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The automated categorization (or classification) of texts into predefined categories lias witnessed a 
booming interest in the last ten years, due to the increased availability of documents in digital form 
and the ensuing need to organize them. In the research community the dominant approach to this 
problem is based on machine learning techniques: a general inductive process automatically builds 
a classifier by learning, from a set of preclassified documents, the characteristics of the categories. 
The advantages of this approach over the knowledge engineering approach (consisting in the 
manual definition of a classifier by domain experts) are a very good effectiveness, considerable 
savings in terms of expert manpower, and straightforward portability to different domains. This 
survey discusses the main approaches to text categorization that fall within the machine learning 
paradigm. We will discuss in detail issues pertaining to three different problems, namely document 
representation, classifier construction, and classifier evaluation. 

Categories and Subject Descriptors: H.3.1 [Information storage and retrieval]: Content anal- 
ysis and indexing — Indexing methods; H.3.3 [Information storage and retrieval]: Informa- 
tion search and retrieval — Information filtering; H.3.3 [Information storage and retrieval]: 

Systems and software — Performance evaluation (efficiency and effectiveness); 1.2.3 [Artificial 
Intelligence] : Learning — Induction 

General Terms: Algorithms, Experimentation, Theory 
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1. INTRODUCTION 

In the last ten years content-based document management tasks (collectively known 
as information retrieval - IR) have gained a prominent status in the information 
systems field, due to the increased availability of documents in digital form and 
the ensuing need to access them in flexible ways. Text categorization (TC - aka 
text classification, or topic spotting), the activity of labelling natural language texts 
with thematic categories from a predefined set, is one such task. TC dates back 
to the early '60s, but only in the early '90s it became a major subfield of the 
information systems discipline, thanks to increased applicative interest and to the 
availability of more powerful hardware. TC is now being applied in many contexts, 
ranging from document indexing based on a controlled vocabulary, to document 
filtering, automated metadata generation, word sense disambiguation, population 
of hierarchical catalogues of Web resources, and in general any application requiring 
document organization or selective and adaptive document dispatching. 

Until the late '80s the most popular approach to TC, at least in the "opera- 
tional" (i.e. real-world applications) community, was a knowledge engineering (KE) 
one, consisting in manually defining a set of rules encoding expert knowledge on 
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how to classify documents under the given categories. In the '90s this approach has 
increasingfy lost popularity (especially in the research community) in favour of the 
machine learning (ML) paradigm, according to which a general inductive process 
automatically builds an automatic text classifier by learning, from a set of preclas- 
sified documents, the characteristics of the categories of interest. The advantages 
of this approach are an accuracy comparable to that achieved by human experts, 
and a considerable savings in terms of expert manpower, since no intervention from 
either knowledge engineers or domain experts is needed for the construction of the 
classifier or for its porting to a different set of categories. It is the ML approach to 
TC that this paper concentrates on. 

Current-day TC is thus a discipline at the crossroads of ML and IR, and as such it 
shares a number of characteristics with other tasks such as information/knowledge 
extraction from texts and text mining [Knight 1999; Pazienza 1997 1. There is still 
considerable debate on where the exact border between these disciplines lies, and the 
terminology is still evolving. "Text mining" is increasingly being used to denote all 
the tasks that, by analyzing large quantities of text and detecting usage patterns, try 
to extract probably useful (although only probably correct) information. According 
to this view, TC is an instance of text mining. TC enjoys quite a rich literature now, 
but this is still fairly scatte redf]. Although two international journals have devote d 
special issues to this topic [ Joachims and Sebastiani 2001 ; Lewis and Hayes 1994 [, 
there are no systematic treatments of the subject: there arc neither textbooks nor 
journals entirely devoted to TC yet, and [Manning and Schiitze 1999, ChapterlG] 
is the only chapter-length treatment of the subject. As a note, we should warn the 
reader that the term "automatic text classification" has sometimes been used in 
the literature to mean things quite different from the ones discussed here. Aside 
from (i) the automatic assignment of documents to a predefined set of categories, 
which is the main topic of this paper, the term has also been used to mean (ii) the 
automatic identification of such a set of categories (e.g. [ Borko and Bcrnick 1963| ]), 
or (iii) the automatic identification of such a set of categories and the grouping of 
documents under them (e.g. [Merkl 1998|), a task usually called text clustering, or 
(iv) any activity of placing text items into groups, a task that has thus both TC 
and text clustering as particular instances [Manning and Schiitze 199£]. 

This paper is organized as follows. In Section ^ we formally define TC and its 
various subcases, and in Section ^ we review its most important applications. Sec- 
tion ^ describes the main ideas underlying the ML approach to classification. Our 
discussion of text classification starts in Section ^ by introducing text indexing, i.e. 
the transformation of textual documents into a form that can be interpreted by a 
classifier-building algorithm and by the classifier eventually built by it. Section ^ 
tackles the inductive construction of a text classifier from a "training" set of pre- 
classified documents. Section discusses the evaluation of text classifiers. Section 
concludes, discussing open issues and possible avenues of further research for TC. 



2. TEXT CATEGORIZATION 



-'^A fully searchable bibliography on TC created and maintained by this author is av ailable at 
tittp: / /liinwww. ira.uka. de/bibliography/Ai /automated. text . categorization. html 
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2.1 A definition of text categorization 

Text categorization is the task of assigning a Boolean value to each pair {dj, Ci) G 
VxC, where P is a domain of documents and C — {ci , . . . , C|c| } is a set of predefined 
categories. A value of T assigned to {dj, Ci) indicates a decision to file dj under 
while a value of F indicates a decision not to file dj under Ci. More formally, the task 
is to approximate the unknown target function $ : 23 x C ^ {T, F} (that describes 
how documents ought to be classified) by means of a function ^ : V x C ^ {T, F} 
called the classifier (aka rule, or hypothesis, or model) such that $ and "coincide 
as much as possible" . How to precisely define and measure this coincidence (called 
effectiveness) will be discussed in Section 7.1. From now on we will assume that: 



— The categories are just symbolic labels, and no additional knowledge (of a pro- 
cedural or declarative nature) of their meaning is available. 

— No exogenous knowledge (i.e. data provided for classification purposes by an ex- 
ternal source) is available; therefore, classification must be accomplished on the 
basis of endogenous knowledge only (i.e. knowledge extracted from the docu- 
ments). In particular, this means that metadata such as e.g. publication date, 
document type, publication source, etc. is not assumed to be available. 

The TC methods we will discuss are thus completely general, and do not depend on 
the availability of special-purpose resources that might be unavailable or costly to 
develop. Of course, these assumptions need not be verified in operational settings, 
where it is legitimate to use any source of information that might be available or 



deemed worth developing |Dfaz Esteban et al. 1998; Junker and Abecker 1997 1 



Relying only on endogenous knowledge means classifying a document based solely 
on its semantics, and given that the semantics of a document is a subjective notion, 
it follows that the membership of a document in a category (pretty much as the 



relevance of a document to an information need in IR [Baracevic 1975 1) cannot be 



decided deterministically. This is exemplified by the phenomenon of inter-indexer 



inconsistency [Cleverdon 1984 1: when two human experts decide whether to classify 
document dj under category Ci, they may disagree, and this in fact happens with 
relatively high frequency. A news article on Clinton attending Dizzy Gillespie's 
funeral could be filed under Politics, or under Jazz, or under both, or even under 
neither, depending on the subjective judgment of the expert. 

2.2 Single-label vs. multi-label text categorization 

Different constraints may be enforced on the TC task, depending on the application. 
For instance we might need that, for a given integer k, exactly k (or < fc, or > k) 
elements of C be assigned to each dj S V. The case in which exactly 1 category 
must be assigned to each dj g 13 is often called the single-label (aka non- overlapping 
categories) case, while the case in which any number of categories from to \C\ 
may be assigned to the same dj G V is dubbed the multi-label (aka overlapping 
categories) case. A special case of single-label TC is binary TC, in which each 
dj G V must be assigned either to category Ci or to its complement 

From a theoretical point of view, the binary case (hence, the single-label case too) 
is more general than the multi-label, since an algorithm for binary classification can 
also be used for multi-label classification: one needs only transform the problem 
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of multi-label classification under {ci, . . . , C|c| } into |C| independent problems of 
binary classification under {ci,Ci}, for i — 1, . . . , |C|. However, this requires that 
categories are stochastically independent of each other, i.e. that for any c', c" the 
value of c') does not depend on the value of $((ij-,c") and viceversa; this 

is usually assumed to be the case (applications in which this is not the case are 



discussed in Section 3.5). The converse is not true: an algorithm for multi-label 
classification cannot be used for either binary or single-label classification. In fact, 
given a document dj to classify, (i) the classifier might attribute A: > 1 categories to 
dj , and it might not be obvious how to choose a "most appropriate" category from 
them; or (ii) the classifier might attribute to dj no category at all, and it might not 
be obvious how to choose a "least inappropriate" category from C. 

In the rest of the paper, unless explicitly mentioned, we will deal with the binary 
case. There are various reasons for this: 

— The binary case is important in itself because important TC applications, in- 



cluding filtering (see Section 3.3), consist of binary classification problems (e.g. 
deciding whether dj is about Jazz or not). In TC, most binary classification prob- 
lems feature unevenly populated categories (e.g. much fewer documents are about 
Jazz than are not) and unevenly characterized categories (e.g. what is about Jazz 
can be characterized much better than what is not). 

-Solving the binary case also means solving the multi-label case, which is also 
representative of important TC applications, including automated indexing for 



Boolean systems (see Section 3.1) 



— Most of the TC literature is couched in terms of the binary case. 

— Most techniques for binary classification are just special cases of existing tech- 
niques for the single-label case, and are simpler to illustrate than these latter. 

This ultimately means that we will view classification under C = {ci, . . . , C|c|} as 
consisting of |C| independent problems of classifying the documents in T> under a 
given category c^, for i = 1, . . . , |C|. A classifier for Ci is then a function $i : P ^ 
{T, F} that approximates an unknown target function <l>i : I? — > {T,F}. 

2.3 Category-pivoted vs. document-pivoted text categorization 

There are two different ways of using a text classifier. Given dj G we might want 
to find all the q S C under which it should be filed [document-pivoted categorization 
- DPC); alternatively, given Ci € C, we might want to find all the dj G V that should 
be filed under it [category-pivoted categorization - CPC). This distinction is more 
pragmatic than conceptual, but is important since the sets C and V might not be 
available in their entirety right from the start. It is also relevant to the choice of 



the classifier-building method, as some of these methods (see e.g. Section |6.9D allow 
the construction of classifiers with a definite slant towards one or the other style. 

DPC is thus suitable when documents become available at different moments in 
time, e.g. in filtering e-mail. CPC is instead suitable when (i) a new category C|c|+i 
may be added to an existing set C — {ci, . . . , c\c\ \ after a number of documents have 
already been classified under C, and (ii) these documents need to be reconsidered 



for classification under C|c|_|_i (e.g. |Larkey 1999|). DPC is used more often than 



CPC, as the former situation is more common than the latter. 
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Although some specific techniques apply to one style and not to the other (e.g. the 
proportional thresholding method discussed in Section 3.1 applies only to CPC), 
this is more the exception than the rule: most of the techniques we will discuss 
allow the construction of classifiers capable of working in either mode. 

2.4 "Hard" categorization vs. ranking categorization 

While a complete automation of the TC task requires a T or F decision for each 
pair {dj,Ci), a partial automation of this process might have different requirements. 

For instance, given dj € V a system might simply rank the categories in C = 
{ci, . . . , c\c\} according to their estimated appropriateness to dj, without taking 
any "hard" decision on any of them. Such a ranked list would be of great help to a 
human expert in charge of taking the final categorization decision, since she could 
thus restrict the choice to the category (or categories) at the top of the list, rather 
than having to examine the entire set. Alternatively, given Ci G C a system might 
simply rank the documents in V according to their estimated appropriateness to Ci ; 
symmetrically, for classification under a a human expert would just examine the 
top-ranked documents instead than the entire document set. These tw o modalitie s 
are sometimes called category-ranking TC and document-ranking TC |Yang 199£], 
respectively, and are the obvious counterparts of DPC and CPC. 



Semi-automated, "interactive" classification systems [Larkeyand Croft 1996| are 
useful especially in critical applications in which the effectiveness of a fully au- 
tomated system may be expected to be significantly lower than that of a human 
expert. This may be the case when the quality of the training data (see Section 
^ is low, or when the training documents cannot be trusted to be a representative 
sample of the unseen documents that are to come, so that the results of a completely 
automatic classifier could not be trusted completely. 

In the rest of the paper, unless explicitly mentioned, we will deal with "hard" 
classification; however, many of the algorithms we will discuss naturally lend them- 
selves to ranking TC too (more details on this in Section |6J| ). 

3. APPLICATIONS OF TEXT CATEGORIZATION 



TC goes back to Maron's |1961] seminal work on probabilistic text classification. 
Since then, it has been used for a number of different applications, of which we here 
briefly review the most important ones. Note that the borders between the different 
classes of applications listed here are fuzzy and somehow artificial, and some of these 
may be considered special cases of others. Other applications we do not explicitly 
discuss are speech categorization by means of a combination of speech recognition 
and TC | Myers et al. 2000 ; Schapire and Singer 200C | , multimedia document catego - 
rization through the analysis of textual captions [ Sable and Hatzivassiloglou 200C [ 



author identification for literary texts of unknown or disputed authorship [ Forsyth 



1999], language identification for texts of unknown language [Cavnar and Trcnkle 



1994], automated identification of text genre ]Kessler et al. 1997], and automated 



essay grading ] Larkey 1998 



3.1 Automatic indexing for Boolean information retrieval systems 



The application that has spawned most of the early research in the field ] Borko and 


Bernick 1963; 


Field 1975; 


Gray and Harley 1971; 


Heaps 1973; 


Maron 1961 


], is that 
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of automatic document indexing for IR systems relying on a controlled dictionary, 
the most prominent example of which is that of Boolean systems. In these latter 
each document is assigned one or more keywords or keyphrases describing its con- 
tent, where these keywords and keyphrases belong to a finite set called controlled 
dictionary, often consisting of a thematic hierarchical thesaurus (e.g. the NASA the- 
saurus for the aerospace discipline, or the MESH thesaurus for medicine). Usually, 
this assignment is done by trained human indexers, and is thus a costly activity. 

If the entries in the controlled vocabulary are viewed as categories, text indexing 
is an instance of TC, and may thus be add ressed by the automatic techniques de- 
scribed in this paper. Recalling Section ^.2| , note that this application may typically 
require that kx < x <k2 keywords are assigned to each document, for given fci, fc2. 
Document-pivoted TC is probably the best option, so that new documents may be 
classified as they become available. Various text classifiers explicitly conceived for 



document indexing have been described in the literature; see e.g. [Fuhr and Knorz 



1984; Robertson and Harding 1984; Tzeras and Hartmann 1993 



Automatic indexing with controlled dictionaries is closely related to automated 
metadata generation. In digital libraries one is usually interested in tagging doc- 
uments by metadata that describe them under a variety of aspects (e.g. creation 
date, document type or format, availability, etc.). Some of these metadata are 
thematic, i.e. their role is to describe the semantics of the document by means of 
bibliographic codes, keywords or keyphrases. The generation of these metadata 
may thus be viewed as a problem of document indexing with controlled dictionary, 
and thus tackled by means of TC techniques. 

3.2 Document organization 

Indexing with a controlled vocabulary is an instance of the general problem of doc- 
ument base organization. In general, many other issues pertaining to document 
organization and filing, be it for purposes of personal organization or structuring of 
a corporate document base, may be addressed by TC techniques. For instance, at 
the offices of a newspaper incoming "classified" ads must be, prior to publication, 
categorized under categories such as Personals, Cars for Sale, Real Estate, 
etc. Newspapers dealing with a high volume of classified ads would benefit from 
an automatic system that chooses the most suitable category for a given ad. Other 
possible applications are the organization of patents into categories for making their 



search easier [ Larkey 1999 1, the automatic filing of newspaper articles under the ap- 
propriate sections (e.g. Politics, Home News, Lifestyles, etc.), or the automatic 
grouping of conference papers into sessions. 

3.3 Text filtering 

Text filtering is the activity of classifying a stream of incoming documents dis- 
patched in an asynchronous way by an information producer to an information 



consumer |Belkin and Croft 1992 1. A typical case is a newsfeed, where the pro- 



ducer is a news agency and the consumer is a newspaper [Hayes et al. 1990 1. In 
this case the filtering system should block the delivery of the documents the con- 
sumer is likely not interested in (e.g. all news not concerning sports, in the case 
of a sports newspaper). Filtering can be seen as a case of single-label TC, i.e. the 
classification of incoming documents in two disjoint categories, the relevant and the 
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irrelevant. Additionally, a filtering system may also further classify the documents 
deemed relevant to the consumer into thematic categories; in the example above, 
all articles about sports should be further classified according e.g. to which sport 
they deal with, so as to allow journalists specialized in individual sports to access 
only documents of prospective interest for them. Similarly, an e-mail filter might 



be trained to discard "junk" mail | Androutsopoulos et al. 2000; Drucker ct al. 1999] 
and further classify non-junk mail into topical categories of interest to the user. 

A filtering system may be installed at the producer end, in which case it must 
route the documents to the interested consumers only, or at the consumer end, in 
which case it must block the delivery of documents deemed uninteresting to the 
consumer. In the former case the system builds and updates a "profile" for each 
consumer |Liddy et al. 1994|, while in the latter case (which is the more common, 
and to which we will refer in the rest of this section) a single profile is needed. 

A profile may be initially specified by the user, thereby resembling a standing 
IR query, and is updated by the system by using feedback information provided 
(either implicitly or explicitly) by the user on the relevance or non-relevance of the 
delivered messages. In the TREC community [ Lewis 1995c ] this is called adaptive 
filtering, while the case in which no user-specified profile is available is called either 
routing or batch filtering, depending on whether documents have to be ranked in 
decreasing order of estimated relevance or just accepted/rejected. Batch filtering 
thus coincides with single-label TC under ]C] — 2 categories; since this latter is 
a completely general TC task some authors ] |HuU 1994| ; |HuU et al. l"99^ ; [Schapire 



ct al. 1998; 3chiitze et al. 1995], somewhat confusingly, use the term "filtering" in 
place of the more appropriate term "categorization" . 

In information science document filtering has a tradition dating back to the 
'60s, when, addressed by systems of various degrees of automation and dealing 
with the multi-consumer case discussed above, it was called selective dissemination 
of information or current awareness (see e.g. [Korfhage 1997, Chapter 6]). The 
explosion in the availability of digital information has boosted the importance of 
such systems, which are nowadays being used in contexts such as the creation of 
personalized Web newspapers, junk e-mail blocking, and Usenet news selection. 

Information filtering by ML techniques is widely discussed in the literature: see 
e.g. ]|Amati and Crestani 1999] ; [Iyer et al. 2000| ; |Kim et al. 2"000| ; [Tauritz et al. 2000| ; 
Yu and Lam 1998[. 



3.4 Word sense disambiguation 

Word sense disambiguation (WSD) is the activity of finding, given the occurrence 
in a text of an ambiguous (i.e. polysemous or homonymous) word, the sense of this 
particular word occurrence. For instance, bank may have (at least) two different 
senses in English, as in the Bank of England (a financial institution) or the bank 
of river Thcunes (a hydraulic engineering artifact). It is thus a WSD task to 
decide which of the above senses the occurrence of bank in Last week I borrowed 
some money from the bank has. WSD is very important for many applications, 
including natural language processing, and indexing documents by word senses 
rather than by words for IR purposes. WSD may be seen as a TC task (see e.g 
pale et al. 1993| ; [Escudero et al. 2000| ) once we view word occurrence contexts as 
documents and word senses as categories. Quite obviously, this is a single-label TC 
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case, and one in which document-pivoted TC is usuaUy the right choice. 

WSD is just an example of the more general issue of resolving natural lan- 
guage ambiguities, one of the most important problems in computational linguis- 
tics. Other examples, which may all be tackled by means of TC techniques along 
the lines discussed for WSD, are context-sensitive spelling correction, prepositional 
phrase attachment, part of speech tagging, and word choice selection in machine 
translation; see | Roth 199^ for an introduction. 



3.5 Hierarchical categorization of Web pages 

TC has recently aroused a lot of interest also for its possible application to auto- 
matically classifying Web pages, or sites, under the hierarchical catalogues hosted 
by popular Internet portals. When Web documents are catalogued in this way, 
rather than issuing a query to a general-purpose Web search engine a searcher may 
find it easier to first navigate in the hierarchy of categories and then restrict her 
search to a particular category of interest. 

Classifying Web pages automatically has obvious advantages, since the manual 
categorization of a large enough subset of the Web is infeasible. Unlike in the 
previous applications, it is typically the case that each category must be populated 
by a set oi ki < x < k2 documents. CPC should be chosen so as to allow new 
categories to be added and obsolete ones to be deleted. 

With respect to previously discussed TC applications, automatic Web page cat- 
egorization has two essential peculiarities: 

(1) The hypertextual nature of the documents: links are a rich source of information, 
as they may be understood as stating the relevance of the linked page to the 
linking page. Techniques exploiting this intuition in a TC context have been 
presented in [Attardi et al. 1998 ; Chakrabarti ct al. 1998b| ; Fiirnkranz 199E 
Covert et al. 199E; Oh et al. 2000| and experimentally compared in |Yang et al. 
200 1|. 



(2) The hierarchical structure of the category set: this may be used e.g. by de- 
composing the classification problem into a number of smaller classification 
problems, each corresponding to a branching decision at an internal node. 
Techniques exploiting th i s intuition in a TC conte x t have been presented in 
[ Dumais and Chen 2000 ; Chakrabarti et al. 1998a ; KoUer and Sahami 1997 ; 



McCallum et al. 1998| ; |Ruiz and Srinivasan 1999| ; |Wcigend et al. 1999| ] 



4. THE MACHINE LEARNING APPROACH TO TEXT CATEGORIZATION 

In the '80s the most popular approach (at least in operational settings) for the 
creation of automatic document classifiers consisted in manually building, by means 
of knowledge engineering (KE) techniques, an expert system capable of taking TC 
decisions. Such an expert system would typically consist of a set of manually defined 
logical rules, one per category, of type 

if {DNF formula) then (category) 

A DNF ("disjunctive normal form") formula is a disjunction of conjunctive clauses; 
the document is classified under (category) iff it satisfies the formula, i.e. iff it 
satisfies at least one of the clauses. The most famous example of this approach is 
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if {{wheat Sz farm) 


or 


{wheat & commodity) 


or 


{bushels & export) 


or 


{wheat & tonnes) 


or 


{wheat & winter & —• soft)) 


then Wheat else Wheat 



Fig. 1. Rule-based classifier for the Wheat category ^ keywords arc indicated in italic, categories 
are indicated in Small Caps (from [Apte et al. 1994 ). 



the Construe system [ Hayes et al. 1990] , built by Carnegie Group for the Reuters 
news agency. A sample rule of the type used in Construe is illustrated in Figure 

The drawback of this approach is the knowledge acquisition bottleneck well-known 
from the expert systems literature. That is, the rules must be manually defined by 
a knowledge engineer with the aid of a domain expert (in this case, an expert in the 
membership of documents in the chosen set of categories): if the set of categories 
is updated, then these two professionals must intervene again, and if the classifier 
is ported to a completely different domain (i.e. set of categories) a different domain 
expert needs to intervene and the work has to be repeated from scratch. 

On the other hand, it was originally suggested that this approach can give very 
good effectiveness results: Hayes et al. [ 199Ct| reported a .90 "breakeven" result 
(see Section ^ on a subset of the Reuters test collection, a figure that outperforms 
even the best classifiers built in the late '90s by state-of-the-art ML techniques. 
However, no other classifier has been tested on the same dataset as Construe, 
and it is not clear whether this was a randomly chosen or a favourable subset of 
the entire Reuters collection. As argued in |Yang 199!;], the results above do not 
allow us to state that these effectiveness results may be obtained in general. 

Since the early '90s, the ML approach to TC has gained popularity and has even- 



tually become the dominant one, at least in the research community (see [Mitchell 



1996 [ for a comprehensive introduction to ML). In this approach a general induc- 



tive process (also called the learner) automatically builds a classifier for a category 
Ci by observing the characteristics of a set of documents manually classified un- 
der Ci or Ci by a domain expert; from these characteristics, the inductive process 
gleans the characteristics that a new unseen document should have in order to be 
classified under c^. In ML terminology, the classification problem is an activity of 
supervised learning, since the learning process is "supervised" by the knowledge of 
the categories and of the training instances that belong to themQ. 

The advantages of the ML approach over the KE approach are evident. The 
engineering effort goes towards the construction not of a classifier, but of an auto- 
matic builder of classifiers (the learner) . This means that if a learner is (as it often 
is) available off-the-shelf, all that is needed is the inductive, automatic construction 
of a classifier from a set of manually classified documents. The same happens if 
a classifier already exists and the original set of categories is updated, or if the 
classifier is ported to a completely different domain. 



^Within the area of content-based document management tasks, an example of an unsupervised 
learning activity is document clustering (see Section hi). 
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In the ML approach the preclassified documents are then the key resource. In the 
most favourable case they are already available; this typicaly happens for organiza- 
tions which have previously carried out the same categorization activity manually 
and decide to automate the process. The less favourable case is when no manually 
classified documents are available; this typicaly happens for organizations which 
start a categorization activity and opt for an automated modality straightaway. 
The ML approach is more convenient than the KE approach also in this latter 
case. In fact, it is easier to manually classify a set of documents than to build and 
tune a set of rules, since it is easier to characterize a concept extensionally (i.e. to 
select instances of it) than intensionally (i.e. to describe the concept in words, or 
to describe a procedure for recognizing its instances). 

Classifiers built by means of ML techniques nowadays achieve impressive levels 
of effectiveness (see Section |^) , making automatic classification a qualitatively (and 
not only economically) viable alternative to manual classification. 

4.1 Training set, test set, and validation set 

The ML approach relies on the availability of an initial corpus il — {c?i, . . . , d|si|} C 
V of documents preclassified under C = {ci, . . . , c\c\}- That is, the values of the 
total function $ : "D x C — > {T,F} are known for every pair {dj,Ci) € x C. A 
document dj is a positive example of q if ^(dj, q) = T, a negative example of Ci if 

In research settings (and in most operational settings too), once a classifier <I> has 
been built it is desirable to evaluate its effectiveness. In this case, prior to classifier 
construction the initial corpus is split in two sets, not necessarily of equal size: 

— a training (-and-validation) set TV — {di, . . . ^d\Tv\\- The classifier $ for cate- 
gories C — {ci, . . . ,C[c|} is inductively built by observing the characteristics of 
these documents; 

— a test set Te = {d\TV\+ij ■ • • i '^loili used for testing the effectiveness of the clas- 
sifiers. Each dj £ Te is fed to the classifier, and the classifier decisions ^{dj^Ci) 
are compared with the expert decisions $(rfj, Ci). A measure of classification ef- 
fectiveness is based on how often the $((ij, q) values match the $((ij, c^) values. 

The documents in Te cannot participate in any way in the inductive construction of 
the classifiers; if this condition were not satisfied the experimental results obtained 
would likely be unrealistically good, and the evaluation would thus have no scientific 
character [ Mitchell 1996| , page 129]. In an operational setting, after evaluation has 



been performed one would typically re-train the classifier on the entire initial corpus, 
in order to boost effectiveness. In this case the results of the previous evaluation 
would be a pessimistic estimate of the real performance, since the final classifier 
has been trained on more data than the classifier evaluated. 

This is called the train- and-test approach. An alternative is the k-fold cross- 



validation approach (see e.g. |Mitchell 1996, page 146]), in which k different clas 



sifiers ^i,...,^^ are built by partitioning the initial corpus into k disjoint sets 
Tei, . . . ,Tek and then iteratively applying the train-and-test approach on pairs 
{TVi = fl — Tei,Tei). The final effectiveness figure is obtained by individually 
computing the effectiveness of . . . and then averaging the individual re- 
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suits in some way. 

In both approaches it is often the case that the internal parameters of the classi- 
fiers must be tuned, by testing which values of the parameters yield the best effec- 
tiveness. In order to make this optimization possible, in the train-and-test approach 
the set {di, . . . , d^Tv\} is further split into a training set Tr — {di, . . . , d\Tr\}, from 
which the classifier is built, and a validation set Va = {(i|Tr|+ii ■ ■ • i d\TV\} (some- 
times called a hold-out set), on which the repeated tests of the classifier aimed 
at parameter optimization are performed; the obvious variant may be used in the 
fc-fold cross-validation case. Note that, for the same reason why we do not test a 
classifier on the documents it has been trained on, we do not test it on the docu- 
ments it has been optimized on: test set and validation set must be kept separateFI. 




Given a corpus fl, one may define the generality gn{ci) of a category Ci as the 
percentage of documents that belong to c^, i.e.: 



The training set generality gxrici), validation set generality gvaici), and test set 
generality gxeici) of Ci may be defined in the obvious way. 

4.2 Information retrieval techniques and text categorization 

Text categorization heavily relies on the basic machinery of IR. The reason is that 
TC is a content-based document management task, and as such it shares many 
characteristics with other IR tasks such as text search. 

IR techniques are used in three phases of the text classifier life cycle: 

(1) IR-style indexing is always performed on the documents of the initial corpus 
and on those to be classified during the operational phase; 

(2) IR-style techniques (such as document-request matching, query reformulation, 
. . . ) are often used in the inductive construction of the classifiers; 

(3) IR-style evaluation of the effectiveness of the classifiers is performed. 

The various approaches to classification differ mostly for how they tackle (||), al- 
though in a few cases non-standard approaches to (|l|) and (0) are also used. Index- 
ing, induction and evaluation are the themes of Sections and |^, respectively. 

5. DOCUMENT INDEXING AND DIMENSIONALITY REDUCTION 
5.1 Document indexing 

Texts cannot be directly interpreted by a classifier or by a classifier-building algo- 
rithm. Because of this, an indexing procedure that maps a text dj into a compact 
representation of its content needs to be uniformly applied to training, validation 
and test documents. The choice of a representation for text depends on what one 
regards as the meaningful units of text (the problem of lexical semantics) and the 
meaningful natural language rules for the combination of these units (the problem 



^From now on, we will take the freedom to use the expression "test document" to denote any 
document not in the training set and validation set. This includes thus any document submitted 
to the classifier in the operational phase. 




9n{ci) 



\{d,en\ $(d„c,)^r}| 
\n\ 
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of compositional semantics). Similarly to what happens in IR, in TC this latter 
problem is usually disregarded^, and a text dj is usually represented as a vector of 
term weights dj — {wij, . . . ,w\q-\j)j where T is the set of terms (sometimes called 
features) that occur at least once in at least one document of Tr, and < Wkj < 1 
represents, loosely speaking, how much term tk contributes to the semantics of 
document dj. Differences among approaches are accounted for by 

(1) different ways to understand what a term is; 

(2) different ways to compute term weights. 

A typical choice for (|^) is to identify terms with words. This is often called either the 
set of words or the bag of words approach to document representation, depending 
on whether weights are binary or not. 



In a number of experiments [Apte et al. 1994; Dumais et al. 1998; Lewis 1992a 
it has been found that representations more sophisticated than this do not yield 



significantly better effectiveness, thereby confirming similar results from IR |Salton 
and Buckley 1988|. In particular, some authors have used phrases, rather than 



individual words, as indexing terms |Fuhr et al. 1991; 3chiitzc et al. 1995; Tzeras 
and Hartmann 1993 1, but the experimental results found to date have not been 



uniformly encouraging, irrespectively of whether the notion of "phrase" is motivated 

— syntactically, i.e. the phrase is such according to a grammar of the language (see 
e.g. [ [Lewis 1992at ); 

— statistically, i.e. the phrase is not grammatically such, but is composed of a 
set / sequence of words whose patterns of contiguous occurrence in the collection 
are statistically significant (see e.g. jCaropreso et al. 2001 1). 



Lewis [1992a I argues that the likely reason for the discouraging results is that, 
although indexing languages based on phrases have superior semantic qualities, 
they have inferior statistical qualities with respect to word-only indexing languages: 
a phrase-only indexing language has "more terms, more synonymous or nearly 
synonymous terms, lower consistency of assignment (since synonymous terms are 
not assigned to the same documents), and lower document frequency for terms" 
[ Lewis 1992a , page 40]. Although his remarks are about syntactically motivated 
phrases, they also apply to statistically motivated ones, although perhaps to a 
smaller degree. A combina tion o f the two approaches is probably the best way to 
go: Tzeras and Hartmann [1992] obtained significant improvements by using noun 
phrases obtained through a combination of syntactic and statistical criteria, where 
a "crude" syntactic method was complemented by a statistical filter (only those 
syntactic phrases that occurred at least three times in the positive examples of 
a category c; were retained). It is likely that the final word on the usefulness of 
phrase indexing in TC has still to be told, and investigations in this direction are 
still being actively pursued [ ^aropreso et al. 2001 ; Mladenic and Grobclnik 199§|[ 
As for issue (0), weights usually range between and 1 (an exception is [Lewis 



et al. 1996|), and for ease of exposition we will assume they always do. As a special 
case, binary weights may be used (1 denoting presence and absence of the term 



^An exception to this is represented by learning approaches based on Hidden Markov Models \ De- 
noyer et al. 2001; Frasconi et al. 2001 . 
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in the document); whether binary or non-binary weights are used depends on the 
classifier learning algorithm used. In the case of non-binary indexing, for deter- 
mining the weight Wkj of term in document dj any IR-style indexing technique 
that represents a document as a vector of weighted terms may be used. Most of 
the times, the standard tfidf function is used (see e.g. | 3alton and Buckley 198^ ]), 
defined as 



tfidfitk^dj) =#(ifc,dj)-l0g 



\Tr\ 



#Tr{tk) 



(1) 



where 4/^{tk,dj) denotes the number of times occurs in dj, and ij^Tr{ik) denotes 
the document frequency of term tk, i.e. the number of documents in Tr in which 
tk occurs. This function embodies the intuitions that (i) the more often a term 
occurs in a document, the more it is representative of its content, and (ii) the more 
documents a term occurs in, the less discriminating it is|^ Note that this formula 
(as most other indexing formulae) weights the importance of a term to a document 
in terms of occurrence considerations only, thereby deeming of null importance 
the order in which the terms occur in the document and the syntactic role they 
play. In other words, the semantics of a document is reduced to the collective 
lexical semantics of the terms that occur in it, thereby disregarding the issue of 
compositional semantics (an exception are the representation techniques used for 
Foil Cohen 1995a and Sleeping Experts [ Cohen and Singer 199911 ). 

In order for the weights to fall in the [0,1] interval and for the documents to be 
represented by vectors of equal length, the weights resulting from tfidf are often 
normalized by cosine normalization, given by: 

tfidfitk,d,) 



Wkj = 



Eil'i(W(is,d,))2 



(2) 



Although normalized tfidf is the most popular o ne, other indexing functions have 
also been used, including probabilisti c techniques [Covert ct al. 1999 [ or techniques 
for indexing structured documents [ Larkey and Croft 1996 [. Functions different 
from tfidf are especially needed when Tr is not available in its entirety from the 
start and 4I^Tr{tk) cannot thus be computed, as e.g. in adaptive filtering; in this 
case approximations of tfidf are usually employed [ Pagan et al. 1997 , Section 4.3]. 

Before indexing, the removal of function words (i.e. topic-neutral words such as 
articles, prepositions, conj unctions, etc.) is almost always performed (exceptions 
include [ Lewis et al. 199(: ; Nigam et al. 2000 : Riloff 199^ ] )p|. Concerning stemming 
(i.e. grouping words that share the same morphological root), its suitability to TC 
is controversial. Although, similarly to unsupervised term clustering (see Section 



5.5.1) of which it is an instance, stemming has sometimes been reported to hurt 
effectiveness (e.g. [Baker and McCallum 1998 ), the recent tendency is to adopt it. 



There exist many variants of tfidf , that differ from each other in terms of logarithms, normal- 
: zation or other correction factors. Formula h i is just one of the possible instances of this class; see 
Salton and Buckley 1988| ; [Singhal et al. 1996| for variations on this theme. 



One application of TC in which it would be inappropriate to remove fu nction words is author 
ident ification for documents of disputed paternity. In fact, as noted in [ [Manning and Schiitzc 
page 589], "it is often the 'little' words that give an author away (for example, the relative 



1999 



frequencies of words like because or though)" 
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as it reduces both the dimensionahty of the term space (see Section 5.3) and the 
stochastic dependence between terms (see Section 3.2). 

Depending on the apphcation, either the full text of the document or selected 
parts of it are indexed. While the former option is the rule, exceptions exist. For 
instance, in a patent categorization application Larkey 1999| indexes only the title, 
the abstract, the first 20 lines of the summary, and the section containing the claims 
of novelty of the described invention. This approach is made possible by the fact 
that documents describing patents are structured. Similarly, when a document title 
is available, one can pay extra importance to the words it contains |Apte et al. 1994; 
Cohen and Singer 199£; Weiss et al. 1999| ]. When documents are flat, identifying 
the most relevant part of a document is instead a non-obvious task. 

5.2 The Darmstadt Indexing Approach 



The AIR/X system [ Fuhr et al. 1991 1 occupies a special place in the literature on 
indexing for TC. This system is the final result of the AIR project, one of the most 
important efforts in the history of TC: spanning a duration of more than ten years 
|Knorz 1982; Tzeras and Hartmann 1993f| , it has produced a system operatively 
employed since 1985 in the classification of corpora of scientific literature of O(IO^) 
documents and O{10*) categories, and has had important theoretical spin-offs in 
the field of probabilistic indexing | Fuhr 1989| ; Fuhr and Buckley 1991 

The approach to indexing taken in AIR/X is known as the Darmstadt Indexing 
Approach (DIA) |Fuhr 1985|. Here, "indexing" is used in the sense of Section 



3.1 



i.e. as using terms from a controlled vocabulary, and is thus a synonym of 



TC (the DIA was later extended to indexing with free terms |Fuhr and Buckley 



1991]). The idea that underlies the DIA is the use of a much wider set of "features" 



than described in Section 5.1. All other approaches mentioned in this paper view 
terms as the dimensions of the learning space, where terms may be single words, 
stems, phrases, or (see Sections 5.5.1 and ^.5.2| ) combinations of any of these. In 
contrast, the DIA considers properties (of terms, documents, categories, or pairwise 
relationships among these) as basic dimensions of the learning space. Examples of 
these are 

— properties of a term t}.: e.g. the idf of tk\ 

— properties of the relationship between a term tk and a document df. e.g. the tf 
of tk in dj] or the location (e.g. in the title, or in the abstract) oitk within dj] 

— properties of a document dj: e.g. the length of dj] 

— properties of a category cf. e.g. the training set generality of q. 

For each possible document-category pair, the values of these features are collected 
in a so-called relevance description vector rd{dj, Ci). The size of this vector is deter- 
mined by the number of properties considered, and is thus independent of specific 
terms, categories or documents (for multivalued features, appropriate aggregation 



'^The AIR/X system, its applications (including the AIR/PHYS system [Biebricher et al. 



an application of AIR/X to indexing physics literature), and its experiments, have also been richly 
documented in a series of pap ers and doctoral theses written in German. The interested reader 
may consult [Fuhr et al. 199l| for a detailed bibliography. 
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functions are applied in order to yield a single value to be included in rd{dj, c;)); in 
this way an abstraction from specific terms, categories or documents is achieved. 

The main advantage of this approach is the possibility to consider additional 
features that can hardly be accounted for in the usual term-based approaches, e.g. 
the location of a term within a document, or the certainty with which a phrase was 
identified in a document. The term-category relationship is described by estimates, 
derived from the training set, of the probability P{ci\tk) that a document belongs to 
category a, given that it contains term tk (the DIA association factor^ Relevance 
description vectors rd{dj, Ci) are then the final representations that are used for the 
classification of document dj under category a. 

The essential ideas of the DIA - transforming the classification space by means 
of abstraction and using a more detailed text representation than the standard 
bag-of-words approach - have not been taken up by other researchers so far. For 
new TC applications dealing with structured documents or categorization of Web 
pages, these ideas may become of increasing importance. 

5.3 Dimensionality reduction 

Unlike in text retrieval, in TC the high dimensionality of the term space (i.e. the 
large value of |T|) may be problematic. In fact, while typical algorithms used in text 
retrieval (such as cosine matching) can scale to high values of |T|, the same does 
not hold of many sophist icated learning algorith ms used for classifier induction (e.g. 



the LLSF algorithm of [ Yang and Chute 1994 1). Because of this, before classifier 



induction one often applies a pass of dimensionality reduction (DR), whose effect 
is to reduce the size of the vector space from |T| to |T'| <C \T\; the set T' is called 
the reduced term set. 

DR is also beneficial since it tends to reduce overfitting, i.e. the phenomenon 
by which a classifier is tuned also to the contingent characteristics of the training 
data rather than just the constitutive characteristics of the categories. Classifiers 
which overfit the training data are good at re-classifying the data they have been 
trained on, but much worse at classifying previously unseen data. Experiments 
have shown that in order to avoid overfitting a number of training examples roughly 



proportional to the number of terms used is needed; Fuhr and Buckley |1991, page 
235] have suggested that 50-100 training examples per term may be needed in TC 
tasks. This means that if DR is performed, overfitting may be avoided even if a 
smaller amount of training examples is used. However, in removing terms the risk 
is to remove potentially useful information on the meaning of the documents. It is 
then clear that, in order to obtain optimal (cost-)effectiveness, the reduction process 
must be performed with care. Various DR methods have been proposed, either from 
the information theory or from the linear algebra literature, and their relative merits 
have been tested by experimentally evaluating the variation in effectiveness that a 
given classifier undergoes after application of the function to the term space. 

There are two distinct ways of viewing DR, depending on whether the task is 
performed locally (i.e. for each individual category) or globally: 



^Ass o ciation factors are called adh esion coefficients in many early papers on TC; see e.g. [Field 



1975; Robertson and Harding 1984 
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-local DR: for each category Ci, a set 7^' of terms, with |7J'| ^ |T|, is chosen for 
classification under c,- (see ( 



Apte et al. 1994; 


Lewis and Ringuette 1994 


; p 


Sable and Hatzivassiloglou 200C|; 


Schiitze et al. 



1995; Wiener et al. 1995|). This means that different subsets of dj are used when 



working with the different categories. Typical values are 10 < |7^'| < 50. 
-global DR: a set T' of terms, with |T'| <C |T| , is chosen for the classification 



unde r all categories C = {ci 



I (see 



1998| ; lYang 1999| ; [Yang and Pedersen 19971] ) 



e.g. 



Caropreso et al. 2001; Mladenic 



This distinction usually does not impact on the choice of DR technique, since most 
such techniques can be used (and have b een u sed) for local and global DR alike 
{supervised DR techniques - see Section 5.5.1 - are exceptions to this rule). In 
the rest of this section we will assume that the global approach is used, although 
everything we will say also applies to the local approach. 

A second, orthogonal distinction may be drawn in terms of the nature of the 
resulting terms: 

— DR by term selection: T' is a subset of T; 

— DR by term extraction: the terms in T' are not of the same type of the terms in 
T (e.g. if the terms in T are words, the terms in T' may not be words at all), 
but are obtained by combinations or transformations of the original ones. 

Unlike in the previous distinction, these two ways of doing DR are tackled by very 
different techniques; we will address them separately in the next two sections. 

5.4 Dimensionality reduction by term selection 

Given a predetermined integer r, techniques for term selection (also called term 
space reduction - TSR) attempt to select, from the original set T, the set T' of 
terms (with |T'| <C \T\) that, whe n used for document indexing, yields the highest 
effectiveness. Yang and Pedersen [1997| have shown that TSR may even result in 
a moderate (< 5%) increase in effectiveness, depending on the classifier, on the 
aggressivity yjtj of the reduction, and on the TSR technique used. 

Moulinier et al. 1 1996 1 have used a so-called wrapper approach, i.e. one in which T' 
is identified by means of the same learning method which will be used for building 
the classifier [ John et al. 1994 1. Starting from an initial term set, a new term 
set is generated by either adding or removing a term. When a new term set is 
generated, a classifier based on it is built and then tested on a validation set. The 
term set that results in the best effectiveness is chosen. This approach has the 
advantage of being tuned to the learning algorithm being used; moreover, if local 
DR is performed, different numbers of terms for different categories may be chosen, 
depending on whether a category is or is not easily separable from the others. 
However, the sheer size of the space of different term sets makes its cost prohibitive 
for standard TC applications. 



A computationally easier alternative is the filtering approach |John et al. 1994], 
i.e. keeping the \T'\ <C \T\ terms that receive the highest score according to a 
function that measures the "importance" of the term for the TC task. We will 
explore this solution in the rest of this section. 
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5.4.1 Document frequency. A simple and effective global TSR function is the doc- 
ument frequency #Tr{tk) of a term tk, i.e. only the terms that occur in the highest 
number of documents are retained. In a series of experiments Yang and Pedersen 



|1997| have shown that with #Tr-(ifc) it is possible to reduce the dimensionality by 
a factor of 10 with no loss in effectiveness (a reduction by a factor of 100 bringing 
about just a small loss). 

This seems to indicate that the terms occurring most frequently in the collection 
are the most valuable for TC. As such, this would seem to contradict a well-known 
"law" of IR, according to which the terms with low-to- medium document frequency 
are the most informative ones | Salton and Buckley 198§| | . But these two results do 
not contradict each other, since it is well-known (see e.g. | Salton et al. 1975 1) that 
the large majority of the words occurring in a corpus have a very low document 
frequency; this means that by reducing the term set by a factor of 10 using document 
frequency, only such words are removed, while the words from low-to-medium to 
high document frequency are preserved. Of course, stop words need to be removed 
in advance, lest only topic- neutral words are retained |Mladenic 1998 1. 

Finally, note that a slightly more empirical form of TSR by document frequency 
is adopted by many authors, who remove all terms occurring in at most x training 
documents (p opular values for x range from 1 to 3), either as the only form of DR 
[ Maron 1961 ; Ittner et al. 1995| or be fore applying another more sophisticated form 
Dumais et al. 1998 ; Li and Jain 1998|| . A variant of th is policy is removing all terms 



that occur at most x times in the training set (e.g. [Dagan et al. 1997; Joachims 



1997]), wi th popular values for x ranging from 1 (e.g. Baker and Mc Galium 199 



to 5 (e.g. Apte et al. 1994; Cohen 1995a |) 



5.4.2 Other information-theoretic term selection functions. Other more sophis- 
ticated information-theoretic functions have been used in the literature, among 
which the DIA association factor [ Fuhr et al. 199l|, chi-square JCaropreso et al 
2001|; palavotti et al. 2000|; ^chiitze et al. 1995| [Sebastiani et al. 2000|; [Yang and 



Pedersen 1997| ; [Yang and Lui 1999(|, NGL coefficient pg et al. 1997|; |Ruiz and Srini- 
vasan 1999| , information gain |Caropreso et al. 2001 ; Larkey 199E ; Lewis 1992a 



Lewis and Ringuette 1994|; [Mladenic 1998| ; [Moulinier and Ganascia 1996|; [Yang and 
Pedersen 1997 ; Yang and Liu 1999 1, mutual information [Dumais et al. 1998|; Lam 
et al. 1997 ; Larkey and Groft 1996|; Lewis and Ringuette 1994 ; Li and Jain 1998 ; 



Moulinier et al. 1996 ; Ruiz and Srinivasan 199£ ; Taira and Haruno 1999 ; Yang and 
Pedersen 1997| , odds ratio [ Caropreso et al. 200l| ; Mladenic 1998| ; Ruiz and Srini 



vasan 1999|] , relevancy score [Wiener et al. 1995[, and GSS coefficient [Galavott 



et al. 2000 [. The mathematical definitions of these measures are summarized for 
convenience in Table |^ Here, probabilities are interpreted on an event space of 
documents (e.g. P(tk,Ci) denotes the probability that, for a random document x, 
term t^ does not occur in x and x belongs to category Ci), and are estimated by 



^For better uniformity Table |l| views all the TSR functions in terms of subjective probability. In 
some cases such as #(tfe,Ci) and x^(*fe>Ci) this is slightly artificial, since these two functions are 
not usually viewed in probabilistic terms. The formulae refer to the "local" (i.e. category-specific) 
forms of the functions, which again is slightly artificial in some cases (e.g. #(tfe,Ci)). Note that 
the NGL and GSS coefficients are here named after their authors, since they had originally been 
given names that might generate some confusion if used here. 
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Function 


Denoted by 


Mathematical form 


Document frequency 


#(tfe,Cj) 


P{tk\Cx) 


DIA (Lssocidtion factov 


Zltu . Cnl 


P(Ci\ti.) 


InfoTmation gain 


/GCtt, c,) 


V V P(t,c). log 

ce{ci,c,}tG{t^,tfc} 


Mutual information 


MI{tk,Ci) 


log ^ 

P(tfe) ■ P(c,) 


Chi-square 




ITrl ■ [P(tt,Ci) ■ P(i,,,Ci) - P(tk,Ci) ■ P(tk,Ci)]^ 
P{tk) ■ P{tk) ■ P{ci) ■ P{Ci) 


NGL coefficient 


NGL{tk,c,) 


sJ\Tr\ ■ [P{tk,Ci) ■ PCtk,Ci) - P{tk,Ci) ■ P{tk,c,)] 


y/p{tk)-P(ik)-P(c,)-P{c,) 


Relevancy score 


RSitk,Ci} 


P{tk\ci) + d 
log — = 

P{tk\Cr)+d 


Odds Ratio 


OR{tk,Ci) 


P(ife|c,)-(1-P(tfc|c,)) 
(1-P(tfc|c,))-P{tfc|c,) 


GSS coefficient 


GSS{tk,c,) 


P{tk,Ci) ■ P(tk,Ci) - P{tk,Ci) ■ P(tk,Ci) 



Table 1. Main functions used for term space reduction purposes. In formati on gain is also known 
as exp ected mutual information; it is used under this name by Lewis [1992a, page 44] and Larkey 
1998]. In the RS{ti^,Ci) formula d is a constant damping factor. 



counting occurrences in the training set. All functions are specified "locally" to a 
specific category a; in order to assess the value of a term tk in a "global" , category- 
independent sense, either the sum fsurn{tk) — X^Ui fi^k, Ci), or the weighted aver- 



age fwavgitk) = Yli=:iP{ci)f{tk,Ci), or the maximum fmaxitk) 
of their category-specific values f{tk, Ci) are usually computed. 

These functions try to capture the intuition that the best terms for Ci are the 
ones distributed most differently in the sets of positive and negative examples of 
Ci. However, interpretations of this principle vary across different functions. For 
instance, in the experimental sciences is used to measure how the results of an 
observation differ (i.e. are independent) from the results expected according to an 
initial hypothesis (lower values indicate lower dependence). In DR we measure how 
independent tk and Ci are. The terms tk with the lowest value for x^iikjCi) are 
thus the most independent from a; since we are interested in the terms which are 
not, we select the terms for which x^(tfe,Ci) is highest. 

While each TSR function has its own rationale, the ultimate word on its value 
is the effectiveness it brings about. Various experimental comparisons of TSR 
functions have thus been carried out |Caropreso et al. 2001; Galavotti ct al. 200C; 
Mladenic 1998|; Yang and Pedcrsen 1997 1. In these experiments most functions 



listed in Table N (with the possible exception of M I) have improved on the results 
of document frequency. For instance, Yang and Pedersen [1997| have shown that, 
with various classifiers and various initial corpora, sophisticated techniques such 
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as IGsum{ik,Ci) or X^ax{^k,Ci) Can reduce the dimensionality of the term space 
by a factor of 100 with no loss (or even with a small increase) of effectiveness. 
Collectively, the experiments reported in the above-mentioned papers seem to in- 
dicate that {ORsum, NGLsum, GSSmax} > {Xmax^ IGsum} > {#wavg, Xwavg} > 

{MImax, Mlujavg}, whcrc ">" means "performs better than". However, it should 
be noted that these results are just indicative, and that more general statements 
on the relative merits of these functions could be made only as a result of compar- 
ative experiments performed in thoroughly controlled conditions and on a variety 
of different situations (e.g. different classifiers, different initial corpora, . . . ). 

5.5 Dimensionality reduction by term extraction 

Given a predetermined |T'| ^ |T|, term extraction attempts to generate, from 
the original set T, a set T' of "synthetic" terms that maximize effectiveness. The 
rationale for using synthetic (rather than naturally occurring) terms is that, due to 
the pervasive problems of polysemy, homonymy and synonymy, the original terms 
may not be optimal dimensions for document content representation. Methods 
for term extraction try to solve these problems by creating artificial terms that 
do not suffer from them. Any term extraction method consists in (i) a method 
for extracting the new terms from the old ones, and (ii) a method for converting 
the original document representations into new representations based on the newly 
synthesized dimensions. Two term extraction methods have been experimented in 
TC, namely term clustering and latent semantic indexing. 

5.5.1 Term clustering. Term clustering tries to group words with a high degree 
of pairwise semantic relatedness, so that the groups (or their centroids, or a rep- 
resentative of them) may be used instead of the terms as dimensions of the vector 
space. Term clustering is different from term selection, since the former tends to 
address terms synonymous (or near-synonymous) with other terms, while the latter 
targets non-informative termsfj. 



Lewis [1992a I was the first to investigate the use of term clustering in TC. The 
method he employed, called reciprocal nearest neighbour clustering, consists in cre- 
ating clusters of two terms that are one the most similar to the other accord- 
ing to some measure of similarity. His results were inferior to those obtained by 
single- word indexing, possibly due to a disappointing performance by the clustering 



method: as Lewis [1992a, page 48] says, "The relationships captured in the clusters 
are mostly accidental, rather than the systematic relationships that were hoped 
for." 



Li and Jain [1998[ view semantic relatedness between words in terms of their 
co-occurrence and co-absence within training documents. By using this technique 
in the context of a hierarchical clustering algorithm they witnessed only a marginal 
effectiveness improvement; however, the small size of their experiment (see Section 



5.11) hardly allows any definitive conclusion to be reached. 



Both [Lewis 1992a; Li and Jain 199S] are examples of unsupervised clustering. 



since clustering is not affected by the category labels attached to the documents. 



Baker and McCallum [1998| provide instead an example of supervised clustering, as 



-^Some term selection methods, such as wrapper methods, also address the problem of redundancy. 
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the distributional clustering method they employ clusters together those terms that 
tend to indicate the presence of the same category, or group of categories. Their 
experiments, carried out in the context of a Naive Bayes classifier (see Section 



6.2), showed only a 2% effectiveness loss with an aggressivity of 1000, and even 



showed some effectiveness improvement with less aggressive levels of reduction. 



Later experiments by Slonim and Tishby |2001] have confirmed the potential of 



supervised clustering methods for term extraction. 



5.5.2 Latent semantic indexing. Latent semantic indexing (LSI - [Deerwester 



et al. 1990 1) is a DR technique developed in IR in order to address the prob- 
lems deriving from the use of synonymous, near-synonymous and polysemous words 
as dimensions of document representations. This technique compresses document 
vectors into vectors of a lower-dimensional space whose dimensions are obtained 
as combinations of the original dimensions by looking at their patterns of co- 
occurrence. In practice, LSI infers the dependence among the original terms from 
a corpus and "wires" this dependence into the newly obtained, independent di- 
mensions. The function mapping original vectors into new vectors is obtained by 
applying a singular value decomposition to the matrix formed by the original doc- 
ument vectors. In TC this technique is applied by deriving the mapping function 
from the training set and then applying it to training and test documents alike. 

One characteristic of LSI is that the newly obtained dimensions are not, unlike in 
term selection and term clustering, intuitively interpretable. However, they work 
well in bringing out the "latent" semantic structure of the vocabulary used in the 



corpus. For instance, Schiitze et al. |1995, page 235] discuss the classification under 



category Demographic shifts in the U.S. with economic impact of a docu- 
ment that was indeed a positive test instance for the category, and that contained, 
among others, the quite revealing sentence "The nation grew to 249.6 million 
people in the 1980s as more Americcins left the industrial and agricul- 
tural heartlands for the South and West". The classifier decision was incor- 
rect when local DR had been performed by x^-based term selection retaining the 
top original 200 terms, but was correct when the same task was tackled by means 
of LSI. This well exemplifies how LSI works: the above sentence does not contain 
any of the 200 terms most relevant to the category selected by x^, but quite pos- 
sibly the words contained in it have concurred to produce one or more of the LSI 
higher-order terms that generate the document space of the category. As Schiitze 



et al. |199E, page 230] put it, "if there is a great number of terms which all con- 
tribute a small amount of critical information, then the combination of evidence is 
a major problem for a term-based classifier" . A drawback of LSI, though, is that if 
some original term is particularly good in itself at discriminating a category, that 
discrimination power may be lost in the new vector space. 



Wiener et al. [ 1995 | use LSI in two alternative ways: (i) for local DR, thus creating 
several category-specific LSI representations, and (ii) for global DR, thus creating a 
single LSI representation for the entire category set. Their experiments showed the 
former approach to perform better than the latter, and both approaches to perform 
better than simple TSR based on Relevancy Score (see Table |l|) . 



Schiitze et al. [1995| experimentally compared LSI-based term extraction with 



X -based TSR using three different classifier learning techniques (namely, linear 
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discriminant analysis, logistic regression and neural networks). Their experiments 
showed LSI to be far more effective than for the first two techniques, while both 
methods performed equally well for the neural network classifier. 

For other TC works that use LSI or similar term extraction techniques see e.g. 
[ HuU 1994| ; |Li and Jain 1998| ; [Schiitze 1998| ; [Wcigend et al. 1999| ; [Yang 199 5| ]. 



6. INDUCTIVE CONSTRUCTION OF TEXT CLASSIFIERS 

The inductive construction of text classifiers has been tackled in a variety of ways. 
Here we will deal only with the methods that have been most popular in TC, but 
we will also briefly mention the existence of alternative, less standard approaches. 

We start by discussing the general form that a text classifier has. Let us recall 
from Section [2!^ that there are two alternative ways of viewing classification: "hard" 
(fully automated) classification and ranking (semi-automated) classification. 

The inductive construction of a ranking classifier for category c,; S C usually 
consists in the definition of a function CSVi : V ^ [0, 1] that, given a document 
dj, returns a categorization status value for it, i.e. a number between and 1 that, 
roughly speaking, represents the evidence for the fact that dj G Ci. Documents 
are then ranked according to their CSVi value. This works for "document-ranking 
TC" ; "category-ranking TC" is usually tackled by ranking, for a given document 
dj, its CSVi scores for the different categories in C = {ci, ... ,c\c\}- 

The CSVi function takes up different meanings according to the learning method 
used: for instance, in the "Naive Bayes" approach of Section |6^ CiST^ {dj ) is defined 
in terms of a probability, whereas in the "Rocchio" approach discussed in Section 



6.7| CSVi{dj) is a measure of vector closeness in |T|-dimensional space. 

The construction of a "hard" classifier may follow two alternative paths. The 
former consists in the definition of a function CSVi ■ 'D — > {T, F}. The latter 
consists instead in the definition of a function CSVi : I' ^ [0, 1], analogous to the 
one used for ranking classification, followed by the definition of a threshold Ti such 
that CSVi{dj) > Ti is interpreted as T while CSVi{dj) < Ti is interpreted as ff^ . 
The definition of thresholds will be the topic of Section xl . In Sections |6.2| to 



6.12 we will instead concentrate on the definition of CSVi, discussing a number of 
approaches that have been applied in the TC literature. In general we will assume 
we are dealing with "hard" classification; it will be evident from the context how and 
whether the approaches can be adapted to ranking classification. The presentation 
of the algorithms will be mostly qualitative rather than quantitative, i.e. will focus 
on the methods for classifier learning rather than on the effectiveness and efficiency 
of the classifiers built by means of them; this will instead be the focus of Section 0. 

6.1 Determining thresholds 

There are various policies for determining the threshold Ti, also depending on the 
constraints imposed by the application. The most important distinction is whether 
the threshold is derived analytically or experimentally. 

The former method is possible only in the presence of a theoretical result that 
indicates how to compute the threshold that maximizes the expected value of the 



^^Alternative methods are possible, such as training a classifier for which some standard, prede- 
fined value such as is the threshold. For ease of exposition we will not discuss them. 
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effectiveness function [Lewis 1995a |. This is typical of classifiers that output proba 



bility estimates of the membership of dj in (see Section |6.2|) and whose effective- 



ness is computed by decision-theoretic measures such as utility (see Section 7.1.3); 
we thus defer the discussion of this policy (which is called probability thresholding 



in jLewis 1995a]) to Section 7.1.3 



When such a theoretical result is not available one has to revert to the latter 
method, which consists in testing different values for Ti on a validation set and 
choosing the value which maximizes effectiveness. We call this policy CSV thresh- 
olding ||Cohcn and Singer 199£ ; Schapire et al. 199S; Wiener et al. 1995|; it is also 
called Scut in [|Yang 1999[| . Different 's are typically chosen for the different Ci's 



A second, popular experimental policy is proportional thresholding jlwayama and 
Tokunaga 1995 ; Larkey 199^; Lewis 1992a ; Lewis and Ringuette 1994 ; Wiener et al 



1995 ] , also called Pout in | Yang 199Sf| . This policy consists in choosing the value of 



Ti for which gvaici) is closest to gxrici), and embodies the principle that the same 
percentage of documents of both training and test set should be classified under Ci . 
For obvious reasons, this policy does not lend itself to document-pivoted TC. 

Sometimes, depending on the application, a fixed thresholding policy (aka "/c- 
per-doc" thresholding [ Lewis 1992a | or Rcut [ Yang 1999 |) is applied, whereby it is 
stipulated that a fixed number k of categories, equal for all c?j's, are to be assigned 
to each document dj. This is often used, for instance, in applications of TC to 
automated document indexing [Field 1975; Lam et al. 1999 [. Strictly speaking, 
however, this is not a thresholding policy in the sense defined at the beginning of 
Section^, as it might happen that d' is classified under a, d" is not, and CSVi{d') < 
CSVi{d"). Quite clearly, this policy is mostly at home with document-pivoted TC. 
However, it suffers from a certain coarseness, as the fact that k is equal for all 
documents (nor could this be otherwise) allows no fine-tuning. 

In his experiments Lewis [ 1992a [ found the proportional policy to be supe- 
rior to probability thresholding when microaveraged effectiveness was tested but 
slightly inferior with macroaveraging (see Section 7.1.1). Yang [1999[ found instead 



C SV thresholding to be superior to proportional thresholding (possibly due to her 
category-specific optimization on a validation set), and found fixed thresholding to 
be consistently inferior to the other two policies. The fact that these latter results 
have been obtained across different classifiers no doubt reinforce them. 

In general, aside from the considerations above, the choice of the thresholding 
policy may also be influenced by the application; for instance, in applying a text 
classifier to document indexing for Boolean systems a fixed thresholding policy 
might be chosen, while a proportional or CSV thresholding method might be chosen 
for Web page classification under hierarchical catalogues. 

6.2 Probabilistic classifiers 



Probabilistic classifiers (see [Lewis 1998| for a thorough discussion) view CSVi{dj) 
in terms of P{ci\dj), i.e. the probability that a document represented by a vector 
dj = {wij, . . . , w^q-y) of (binary or weighted) terms belongs to Ci, and compute this 
probability by an application of Bayes' theorem, given by 



(3) 
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In (||) the event space is the space of documents: P{dj) is thus the probability 
that a randomly picked document has vector dj as its representation, and P{ci) the 
probability that a randomly picked document belongs to Ci . 

The estimation of P{dj\ci) in (^) is problematic, since the number of possible 
vectors dj is too high (the same holds for P{dj), but for reasons that will be 
clear shortly this will not concern us). In order to alleviate this problem it is 
common to make the assumption that any two coordinates of the document vector 
are, when viewed as random variables, statistically independent of each other; this 
independence assumption is encoded by the equation 



\r\ 



P{dj\ci) = JJP(wfcilci) 



(4) 



k=l 



Probabilistic classifiers that use this assumption arc called Naive Bayes classifiers, 
and account for most of the probabilistic approaches to TC in the literature (see 
e.g. Jjoachims 1998 ; KoUer and Sahami 1997; Larkey and Croft 1996; Lewis 1992a; 
Lewis and Gale 1994; Li and Jain 199? ; Robertson and Harding 1984|). The "naive" 
character of the classifier is due to the fact that usually this assumption is, quite 
obviously, not verified in practice. 

One of the best-known Nai've Bayes approaches is the binary independence clas- 



sifier I Robertson and Sparck Jones 1976 1, which results from using binary- valued 
vector representations for documents. In this case, if we write pki as short for 
P{wkx = l|ci), the P{wkj\ci) factors of (^ may be written as 

Pki 



p{wkj\c.) - p::'il~Pk^y 



(- 



-r'''{l-pk^) 



(5) 



I- Pki 

We may further observe that in TC the document space is partitioned into two 
categories^, Ci and its complement such that P{ci\dj) = 1 — P{ci\dj). If we 
plug in (|) and (|^) into (|^) and take logs we obtain 



log P(c, I d,) = logP(Q) + 

] Wkj log 

log(l-P(c.M,)) = log(l - P(c,)) - 



(6) 



k=l 



Pki 



1 -Pki 



\r\ 
fc=l 



Pki) - logP(dj 



(7) 



E ■ 



Pki 



k=l 



Pki 



E 

fc=i 



log(l-p,-)-logP(d,) 



where we write Pj^j as short for P{wkx — l|ci)- We may convert ^ and (0) into a 
single equation by subtracting componentwise (0) from (|^), thus obtaining 



log 



P{cM, 



l-P{c^\d,) 



= log 



P{c, 



1 - P(c,;) 



E, Pki{l 
Wkj log 

fc=i 



■Pki) 



Pk-{l-Pk^) 



+Eiog 

k=l 



Pki 



^-Pk. 



(8) 



^^Coopcr | |l995| has pointed out that in this case the full independence assumption of (^) is not 
actually made in the NaiVe Bayes classifier; the assumption needed here is instead the weaker 

linked dependence assumption, which may be written as ^^'^i ^^'^ = TT]/^' ^ 

P{dj\ci) iifc— 1 J^^y^kji^i) 
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Note that ^ ^(^^ I j^) ^ jg increasing monotonic function of P{ci\dj), and may thus 

be used directly as CSVi{dj). Note also that log I'^pll.^ and X]L='i l^S T^f^ ^.re con- 
stant for all documents, and may thus be disregarded]^. Defining a classifier for cat- 
egory Ci thus basically requires estimating the 2|T| parameters {pii, Pj^j, • ■ • , P\T\iTP\r\l} 
from the training data, which may be done in the obvious way. Note that in general 
the classification of a given document does not require to compute a sum of \T\ fac- 
tors, as the presence of X^fcL'i ^fcj log p'°!(i-p^') would imply; in fact, all the factors 
for which Wkj = may be disregarded, and this accounts for the vast majority of 
them, since document vectors are usually very sparse. 

The method we have illustrated is just one of the many variants of the Nai've 
Bayes appro ach, the common denominator of which is (^). A recent paper by 
Lewis 



1998 1 is an excellent roadmap on the various directions that research on 



Nai've Bayes classifiers has taken; among these are the ones aiming 

— to relax the constraint that document vectors should he binary-valued. This looks 



natural, given that weighted indexing techniques (see e.g. |Fuhr 1989; Salton and 



Buckley 198q]) accounting for the "importance" of tk for dj play a key role in IR. 



-to introduce document length normalization. The value of log 



tends to 



1-P(c.|d,) 

be more extreme (i.e. very high or very low) for long documents (i.e. documents 
such that Wkj = 1 for many values of fc), irrespectively of their semantic relat- 
edness to Ci , thus calling for length normalization. Taking length into account is 
easy in non-probabilistic approaches to c lassification (see e.g. Section 6.7), but 
is problematic in probabilistic ones (see [ Lewis 199^ , Section 5]). One possible 
answer is to switch from an interpretation of Naive Bayes in which documents are 



events to one in which terms are events [Baker and McCallum 199S; McCallum 



et al. 1998; Chakrabarti et al. 1998a; Guthrie ct al. 1994[. This accounts for 



document length naturally but, as noted in [Lewis 199? |, has the drawback that 
different occurrences of the same word within the same document are viewed as 
independent, an assumption even more implausible than (^). 

-to relax the independence assumption. This may be the hardest route to follow, 
since this produces classifiers of higher computational cost and c haracterized by 
harder parameter estimation problems [KoUer and Sahami 1997 1. Earlier efforts 
in this direction within probabilistic text search (e.g. [van Rijsbergen 1977|) have 
not shown the performance improvements that were hoped for. Recently, the 
fact that the binary independence assumption seldom harms effectiveness has 



also been given some theoretical justification [Domingos and Pazzani 1997 1 



The quotation of text search in the last paragraph is not casual. Unlike other types 
of classifiers, the literature on probabilistic classifiers is inextricably intertwined 
with that on probabilistic search systems (see [ Crestani et al. 1998 | for a review), 
since these latter attempt to determine the probability that a document falls in the 



^This is not true, however, if the "fixed thresholding" method of Section |6.l| is adopted. In fact, 
for a fixed document dj the first and third factor in the formula above are different for different 
categories, and may therefore influence the choice of the categories under which to file dj. 



Machine Learning in Automated Text Categorization • 25 



cominu<iitv 




Fig. 2. A decision tree equivalent to the DNF rule of Figure ^ Edges are labelled by terms and 
leaves are labelled by categories (underlining denotes negation). 



category denoted by the query, and since they are the only search systems that take 
relevance feedback, a notion essentially involving supervised learning, as central. 

6.3 Decision tree classifiers 

Probabilistic methods are quantitative (i.e. numeric) in nature, and as such have 
sometimes been criticized since, effective as they may be, are not easily interpretable 
by humans. A class of algorithms that do not suffer from this problem are symbolic 
(i.e. non- numeric) alg orithms, among which inductive rule learners (which we will 
discuss in Section |6.4| ) and decision tree learners a re the most im portant examples. 

A decision tree (DT) text classifier (see e.g. [ Mitchell 1996 , Chapter 3]) is a 
tree in which internal nodes are labelled by terms, branches departing from them 
are labelled by tests on the weight that the term has in the test document, and 
leafs are labelled by categories. Such a classifier categorizes a test document dj by 
recursively testing for the weights that the terms labeling the internal nodes have 
in vector dj , until a leaf node is reached; the label of this node is then assigned to 
dj. Most such classifiers use binary document representations, and thus consist of 
binary trees. An example DT is illustrated in Figure |^. 

There are a number of standard packages for DT learning, and most DT ap- 
proaches to TC have made use of one such package. Among the most popular ones 
are ID3 (used in jFuhr et al. 199l[), C4.5 (used in |Cohcn and Hirsh 1998| ; [Cohen 



and Singer 1999: Joachims 1998; Lewis and Catlctt 1994 1) and C5 (used in |Li and 
Jain 1998|). TC efforts based on experimental DT packages include |Dumais ct al 
1998| ; [Lewis and Ringuette 1994| ; [Weiss et al. 1999[ . 

A possible method for learning a DT for category Ci consists in a "divide and 
conquer" strategy of (i) checking whether all the training examples have the same 
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label (either Ci or c^); (ii) if not, selecting a term tk, partitioning Tr into classes 
of documents that have the same value for tk, and placing each such class in a 
separate subtree. The process is recursively repeated on the subtrees until each leaf 
of the tree so generated contains training examples assigned to the same category a , 
which is then chosen as the label for the leaf. The key step is the choice of the term 
tk on which to operate the partition, a choice which is generally made according to 
an information gain or entropy criterion. However, such a "fully grown" tree may 
be prone to overfitting, as some branches may be too specific to the training data. 
Most DT learning methods thus include a method for growing the tree and one for 
pruning it, i.e. for removing the overly specific branches. Variations on this basic 
schema for DT learning abound [Mitchell 1996, Section 3]. 



DT text classifiers have been used either as the main classification tool Fuhr et al 



1991; Lewis and Catlett 1994; Lewis and Ringuette 1994], or as baseline classifiers 



llCohcn and Singer 199£ ^ Joachims 1998 L or as members of c lassifier committees [pi| 



and Jain 1998| ; [Schapire and Singer 2000| ; |v\^iss et al. 1999 



6.4 Decision rule classifiers 

A classifier for category Ci built by an inductive rule learning method consists of 
a DNF rule, i.e. of a conditional rule with a premise in disjunctive normal form 
(DNF), of the type illustrated in Figure |l|^'*. The literals (i.e. possibly negated 
keywords) in the premise denote the presence (non-negated keyword) or absence 
(negated keyword) of the keyword in the test document dj, while the clause head 
denotes the decision to classify dj under DNF rules are similar to DTs in that 
they can encode any Boolean function. However, an advantage of DNF rule learners 
is that they tend to generate more compact classifiers than DT learners. 

Rule learning methods usually attempt to select from all the possible covering 
rules (i.e. rules that correctly classify all the training examples) the "best" one 
according to some minimality criterion. While DTs are typically built by a top- 
down, "divide-and-conquer" strategy, DNF rules are often built in a bottom-up 
fashion. Initially, every training example dj is viewed as a clause 771 , . . . , 77„ — > 7.^ , 
where 771 , . . . , 77„ are the terms contained in dj and 7^ equals Ci or q according to 
whether dj is a positive or negative example of This set of clauses is already a 
DNF classifier for c^, but obviously scores high in terms of overfitting. The learner 
applies then a process of generalization in which the rule is simplified through a 
series of modifications (e.g. removing premises from clauses, or merging clauses) 
that maximize its compactness while at the same time not affecting the "covering" 
property of the classifier. At the end of this process, a "pruning" phase similar in 
spirit to that employed in DTs is applied, where the ability to correctly classify all 
the training examples is traded for more generality. 

DNF rule learners vary widely in terms of the methods, heuristics and criteria 
employed for generalization and pruning. Among the DNF rule learners that have 



been applied to TC are Charade jMoulinier and Ganascia 1996 
Yamanishi l"999| . Ripper [ |Cohen 19954 |Cohen and Hirsh 1998] 



DL-ESC [Li and 



Cohen and Singer 



p^pVIany inductive rule learning algorithms build decision lists (i.e. arbitrarily nested if -then-else 
clauses) instead of DNF rules; since the former may always be rewritten as the latter we will 
disregard the issue. 



Machine Learning in Automated Text Categorization • 27 



1999], Scar jMoulinier et al. 1996[ , and SwAP-1 |Apte et al. 1994| ]. 

While the methods above use rules of propositional logic (PL) , research has also 
been carried out using rules of first order logic (FOL), obtainable through the use of 
inductive logic programming methods. Cohen 1 1995a] has extensively compared PL 
and FOL learning in TC (for instance, comparing the PL learner RiPPER with its 
FOL version Flipper), and has found that the additional representational power 
of FOL brings about only modest benefits. 

6.5 Regression methods 



Various TC efforts have used regression models (see e.g. ]Fuhr and Pfeifer 1994; 
Ittner et al. 1995; Lewis and Gale 1994; ^chiitze et al. 199*5 ]). Regression denotes the 
approximation of a real-valued (instead than binary, as in the case of classification) 
function $ by means of a function $ that fits the training data ] Mitchell 1996| , 
page 236]. Here we will describe one such model, the Linear Least Squares Fit 
(LLSF) applied to TC by Yang and Chute ]1994]. In LLSF, each document dj has 
two vectors associated to it: an input vector L{dj) of \T\ weighted terms, and an 
output vector 0{dj) of \C\ weights representing the categories (the weights for this 
latter vector are binary for training documents, and are non-binary CSVs for test 
documents). Classification may thus be seen as the task of determining an output 
vector 0{dj) for test document dj, given its input vector I{dj); hence, building a 
classifier boils down to computing a \C\ x \T\ matrix Af such that MI{dj) — 0{dj). 

LLSF computes the matrix from the training data by computing a linear least- 
squares fit that minimizes the error on the training set according to the formula 
M = argmiuM 11-^^^0]]f, where argminM(2:) stands as usual for the M for which 

def 



X is minimum, \\V\\ 



vt^ represents the so-called Frobenius norm 



of a \C\ X \T\ matrix, / is the \T\ x \Tr\ matrix whose columns are the input vectors 
of the training documents, and O is the \C\ x \Tr\ matrix whose columns are the 
output vectors of the training documents. The M matrix is usually computed by 
performing a singular value decomposition on the training set, and its generic entry 
Wife represents the degree of association between category Ci and term t^- 

The experiments of |Yang and Chute 1994; Yang and Liu 1999] indicate that 
LLSF is one of the most effective text classifiers known to date. One of its disad- 
vantages, though, is that the cost of computing the M matrix is much higher than 
that of many other competitors in the TC arena. 

6.6 On-line methods 

A linear classifier for category Ci is a vector ci — {wu, . . . ,w\q-\i) belonging to 
the same ]T]-dimensional space in which documents are also represented, and such 



that CSVi{dj) corresponds to the dot product X^Ui ''^kiWkj of dj and q. Note that 
when both classifier and document weights are cosine- normalized (see (||)), the dot 
product between the two vectors corresponds to their cosine similarity, i.e. 



2Zk=i 



S(ci,dj) = cos (a) 



Z^fc=i 



which represents the cosine of the angle a that separates the two vectors. This is 
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the similarity measure between query and document computed by standard vector- 
space IR engines, which means in turn that once a Unear classifier has been built, 
classification can be performed by invoking such an engine. Practically all search 
engines have a dot product flavour to them, and can therefore be adapted to doing 
TC with a linear classifier. 

Methods for learning linear classifiers are often partitioned in two broad classes, 
batch methods and on-line methods. 

Batch methods build a classifier by analysing the training set all at once. Within 
the TC literature, one example of a batch method is linear discriminant analysis, 
a model of the stochastic dependence between terms that relies on the covariance 



matrices of the categories |Hull 1994; Schiitzc ct al. 1995]. However, the foremost 



example of a batch method is the Rocchio method; becaus e of its importance in 
the TC literature this will be discussed separately in Section 3.7. In this section we 
will instead concentrate on on-line classifiers. 

On-line (aka incremental) methods build a classifier soon after examining the 
first training document, and incrementally refine it as they examine new ones. This 
may be an advantage in the applications in which Tr is not available in its entirety 
from the start, or in which the "meaning" of the category may change in time, 
as e.g. in adaptive filtering. This is also apt to applications (e.g. semi- automated 
classification, adaptive filtering) in which we may expect the user of a classifier to 
provide feedback on how test documents have been classified, as in this case further 
training may be performed during the operating phase by exploiting user feedback. 

A simple on-line method is the perceptron algorithm, first applied to TC in 
pchiitze et al. 1995 ; Wiener et al. 1995 1 and subsequently used in jDagan et al 



1997; Ng et al. 1997]. In this algorithm, the classifier for Ci is first initialized by 
setting all weights Wki to the same positive value. When a training example dj 
(represented by a vector dj of binary weights) is examined, the classifier built so 
far classifies it. If the result of the classification is correct nothing is done, while if 
it is wrong the weights of the classifier are modified: if dj was a positive example of 
Ci then the weights Wki of "active terms" (i.e. the terms tk such that Wkj = 1) are 
"promoted" by increasing them by a fixed quantity a > (called learning rate), 
while if dj was a negative example of Ci then the same weights are "demoted" by 
decreasing them by a. Note that when the classifier has reached a reasonable level 
of effectiveness, the fact that a weight Wki is very low means that tj- has negatively 
contributed to the classification process so far, and may thus be discarded from the 
representation. We may then see the perceptron algorithm (as all other incremental 
learning methods) as allowing for a sort of "on-the-fly term space reduction" [ Pagan 



et al. 1997, Section 4.4]. The perceptron classifier has shown a good effectiveness 



in all the experiments quoted above. 

The perceptron is an additive weight-updating algorithm. A multiplicative vari- 
ant of it is Positive Winnow [Dagan ct al. 1997], which differs from perceptron 
because two different constants ai > 1 and < a2 < 1 are used for promoting and 
demoting weights, respectively, and because promotion and demotion are achieved 



by multiplying, instead of adding, by ai and a2- Balanced Winnow [Dagan 



et al. 19971 is a further variant of Positive Winnow, in which the classifier con- 



sists of two weights w'^- and Wf^^ for each term tk', the final weight Wki used in 
computing the dot product is the difference w^- — w^- . Following the misclassifica- 
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tion of a positive instance, active terms have their w'^- weight promoted and their 
WjJ^- weight demoted, whereas in the case of a negative instance it is that gets 
demoted while w'^^ gets promoted (for the rest, promotions and demotions are as in 
Positive Winnow). Balanced Winnow allows negative wki weights, while in 
the perceptron and in Positive Winnow the wki weights are always positive. In 



experiments conducted by Dagan et al. [1997|, Positive Winnow showed a better 
effectiveness than perceptron but was in turn outperformed by (Dagan et al. 's own 
version of) Balanced Winnow. 

Other on-line methods for building text classifiers are Widrow-Hoff, a refine- 
me nt of it called Exp onentiated Gradient (b oth applied for the first t ime to TC 
in Lewis et al. 1996 1) and Sleeping Experts Cohen and Singer 199E |, a version 
of Balanced Winnow. While the first is an additive weight-updating algorithm, 
the second and third are multiplicative. Key differences with the previously de- 
scribed algorithms are that these three algorithms (i) update the classifier not only 
after misclassifying a training example, but also after classifying it correctly, and 
(ii) update the weights corresponding to all terms (instead of just active ones). 

Linear classifiers lend themselves to both category-pivoted and document-pivoted 
TC. For the former the classifier q is used, in a standard search engine, as a query 
against the set of test documents, while for the latter the vector dj representing 
the test document is used as a query against the set of classifiers {ci, . . . , c\c\}- 

6.7 The Rocchio method 

Some linear classifiers consist of an explicit profile (or prototypical document) of the 
category. This has obvious advantages in terms of interpretability, as such a profile 
is more readily understandable by a human than, say, a neural network classifier. 
Learning a linear classifier is often preceded by local TSR; in this profile 
of Ci is a weighted list of the terms whose presence or absence is most useful for 
discriminating 

The Rocchio method is used for inducing linear, profile-style classifiers. It relies 
on an adaptation to TC of the well-known Rocchio's formula for relevance feedback 
in the vector-space model, and it is perhaps the only TC method rooted in the IR 
tradition rather than in the ML one. This adaptation was first proposed by Hull 



1 1994 1 , and has been used by many authors since then, cither as an object of research 



Schapire et al. 1998; Singhal et al. 1997], or as a baseline classifier fCohen and 


Singer 1999 


; palavotti et al. 2000|; Joachims 199S ; 


Lewis et al. 199£; 


Schapire and 


Singer 2000 


; Schiitze et al. 1995|, 


or as a member of a classifier committee [ 


Larkey 



and Croft 1996 1 (see Section 3.11) 



Rocchio's method computes a classifier Ci 



{Wli 



,w\r\i) for category q by 



means of the formula 

Wki = /3 



E 

,ePOSi} 



Wkj 

POS, 



E 



Wkj 



NEG, 



where Wkj is the weight of tk in document dj, POSi = {dj G Tr \ ^{dj, Ci) — T} and 
NEGi — {dj G Tr \ ^{dj, a) — F}. In this formula, (3 and 7 are control parameters 
that allow setting the relative importance of positive and negative examples. For 
instance, if /3 is set to 1 and 7 to (as e.g. in |Dumais et al. 199S; Hull 1994; 
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Fig. 3. A comparison between the TC behaviour of (a) the Rocchio classifier, and (b) the fc-NN 
classifier. Small crosses and circles denote positive and negative training instances, respectively. 
The big circles denote the "influence area" of the classifier. Note that, for ease of illustration, 
document similarities are here viewed in terms of Euclidean distance rather than, as more common, 
in terms of dot product or cosine. 



Joachims 199S| ; ^chiitze et al. 199^ ), the profile of q is the centroid of its positive 



training examples. A classifier built by means of the Rocchio method rewards 
the closeness of a test document to the centroid of the positive training examples, 
and its distance from the centroid of the negative training examples. The role of 
negative examples is usually de-emphasized, by setting /3 to a high value and 7 to 



a low one (e.g. Cohen and Singer |199£], Ittner et al. [1995|, and Joachims |1997] 
use /3 = 16 and 7 = 4). 

This method is quite easy to implement, and is also quite efRcient, since learning 
a classifier basically comes down to averaging weights. In terms of effectiveness, 
instead, a drawback is that if the documents in the category tend to occur in 
disjoint clusters (e.g. a set of newspaper articles lebelled with the Sports category 
and dealing with either boxing or rock-climbing) , such a classifier may miss most of 
them, as the centroid of these documents may fall outside all of these clusters (see 
Figure ||a). More generally, a classifier built by the Rocchio method, as all linear 
classifiers, has the disadvantage that it divides the space of documents linearly. 
This situation is graphically depicted in Figure where documents are classified 
within Ci if and only if they fall within the circle. Note that even most of the 
positive training examples would not be classified correctly by the classifier. 

6.7.1 Enhancements to the basic Rocchio framework. One issue in the applica- 
tion of the Rocchio formula to profile extraction is whether the set NEGi should 
be considered in its entirety, or whether a well-chosen sample of it, such as the 
set NPOSi of near-positives (defined as "the most positive amongst the negative 
training examples"), should be selected from it, yielding 

Wk] \ - Wkj 



\POSi\ ' ^ \NPOS^ 
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TheXijd.GWPOsa l Afpos.l factor is more significant than Xiid^eArBc,} |jvgG. |' since 
near-positives are the most difficult documents to tell apart from the positives. Us- 
ing near-positi ves c orresponds to the query zoning method proposed for IR by 
Singhal et al. [1997|. This method originates from the observation that when the 
original Rocchio formula is used for relevance feedback in IR, near-positives tend to 
be used rather than generic negatives, as the documents on which user judgments 
are available are among the ones that had scored highest in the previous ranking. 
Early applications of the Rocchio formula to TC (e.g. |Hull 1994; [ttner et al. 1995|) 
generally did not make a distinction between near-positives and generic negatives. 



In order to select the near-positives Schapire et al. |199S] issue a query, consisting of 
the centroid of the positive training examples, against a document base consisting 
of the negative training examples; the top-ranked ones ar e the most similar to this 
centroid, and are then the near-positives. Wiener et al. 1995 instead equate the 
near-positives of Ci to the positive examples of the sibling categories of c,, as in 
the application they work on (TC with hierarchical category sets) the no tion of a 
"sibl i ng category of Cj" is well-d e fined. A similar polic y is also adopted in [ Ng et al 
1997|; |Ruiz and Srinivasan 19991; [Weigcnd et al. 1999|. 



By using query zoning plus other enhancements (TSR, statistical phrases, and 
a method called dynamic feedback optimization), Schapire et al. | 1998| have found 
that a Rocchio classifier can achieve an effectiveness c ompara ble to that of a state- 
of-the-art ML method such as "boosting" (see Section 6.11.1 ) while being 60 times 
quicker to train. These recent results will no doubt bring about a renewed interest 



for the Rocchio classifier, previously considered an underperformcr | Cohen and 
Singer 1999t [loachims 1998| ; [Lewis et al. 1996| ; [Schiitze et al. 1995| ; [Yang 1999| ' 



6.8 Neural networks 

A neural network (NN) text classifier is a network of units, where the input units 
represent terms, the output unit(s) represent the category or categories of interest, 
and the weights on the edges connecting units represent dependence relations. For 
classifying a test document dj, its term weights Wkj are loaded into the input units; 
the activation of these units is propagated forward through the network, and the 
value of the output unit(s) determines the categorization decision(s). A typical 
way of training NNs is backpropagation, whereby the term weights of a training 
document are loaded into the input units, and if a misclassification occurs the error 
is "backpropagated" so as to change the parameters of the network and eliminate 
or minimize the error. 

The simplest type of NN classifier is the perceptron [ Pagan et al. 199*^ ; Ng et al 



1997], which is a linear classifier and as such has been extensively discussed in 
Section 3.6. Other types of linear NN classifiers implementing a form of logistic 



regression have also been proposed and tested by Schiitze et al. |1995] and Wiener 



A non-linear NN 


Lam and Lee 1999 




Ruiz and Srinivasan 1999 




Schiitze et al. 


1995; Weigcnd et al. 1999; 


Wiener et al. 1995|; 


Y^ang and Liu 1999 


] is instead a 



network with one or more additional "layers" of units, which in TC usually represent 
higher-order interactions between terms that the network is able to learn. When 
comparative experiments relating non-linear NNs to their linear counterparts have 
been performed, the former have yielded either no improvement pchiitze et~al 
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1995] or very small improvements [Wiener et al. 1995| over the latter 



6.9 Example-based classifiers 

Example-based classifiers do not build an explicit, declarative representation of the 
category a, but rely on the category labels attached to the training documents 
similar to the test document. These methods have thus been called lazy learners, 
since "they defer the decision on how to generalize beyond the training data until 



each new query instance is encountered" | Mitchell 1996, pag 244]. 

The first application of example-based methods (aka memory-based reasoning 
methods) to TC is due to Creecy, Ma sand and colleagues [Creecy et al. 1992|; 
Masand et al. 1992 1; examples include | Joachims 1998|; Lam et al. 199S|;~ Larkey 
1998| ; [Larkey 1999| ; |Li and Jain 199"8| ; |Yang and Pedersen 1997| ; |Yang and Liu 1999|] . 
Our presentation of the example-based approach will be based on the k-NN (for "fc 



nearest neighbours") algorithm used by Yang [1994]. For deciding whether dj £ a 



/c-NN looks at whether the k training documents most similar to dj also are in 
Ci] if the answer is positive for a large enough proportion of them, a positive de- 
cision is taken, and a negative decision is taken otherwise. Actually, Yang's is a 
distance-weighted version of fc-NN (see e.g. [Mitchell 1996, Section 8.2.1]), since the 
fact that a most similar document is in Ci is weighted by its similarity with the test 
document. Classifying dj by means of fc-NN thus comes down to computing 



CSV^id,) 



J2 RSV{dj,d,)-Md, 



(9) 



where Trk{dj) is the set of the fc documents dz which maximize RSV{dj,dz) and 



I"] 



1 if a = T 
if a = F 



The thresholding methods of Section 6.1 can then be used to convert the real- valued 
CSVi^s into binary categorization decisions. In (^), RSV{dj,dz) represents some 
measure or semantic relatedness between a test document dj and a training docu- 
ment dz] any matching function, be it probabilistic (as used in [Larkey and Croft 
1996|]) or vector-based (as used in [ Yang 1994| ), from a ranked IR system may be 



used for this purpose. The construction of a fc-NN classifier also involves determin- 
ing (experimentally, on a validation set) a threshold fc that indicates how many top- 
ranked training documents have to be considered for computing CSVi{dj). Larkey 
and Croft |l996|[ use fc = 20, while Yang |l994i |1999| has found 30 < fc < 45 to yield 
the best effectiveness. Anyhow, various experiments have shown that increasing the 
value of fc does not significantly degrade the performance. 

Note that fc-NN, unlike linear classifiers, does not divide the document space 
linearly, hence does not suffer from the problem discussed at the end of Section 
6.7 . This is graphically depicted in Figure [Hd, where the more "local" character of 
fc-NN with respect to Rocchio can be appreciated. 

This method is naturally geared towards document-pivoted TC, since ranking the 
training documents for their similarity with the test document can be done once 
for all categories. For category-pivoted TC one would need to store the document 
ranks for each test document, which is obviously clumsy; DPC is thus de facto the 
only reasonable way to use fc-NN. 



Machine Learning in Automated Text Categorization • 33 



A number of different experiments (see Section 7.3) have shown fc-NN to be quite 
effective. However, its most important drawback is its inefficiency at classification 
time: while e.g. with a linear classifier only a dot product needs to be computed 
to classify a test document, fc-NN requires the entire training set to be ranked 
for similarity with the test document, which is much more expensive. This is a 
drawback of "lazy" learning methods, since they do not have a true training phase 
and thus defer all the computation to classification time. 

6.9.1 Other example-based techniques. Various example-based techniques have 



been used in the TC literature. For example, Cohen and Hirsh 1998 implement 
an example-based classifier by extending standard relational DBMS technology with 
"similarity-based soft joins" . In their Whirl system they use the scoring function 

csvdd,) = i- n 

as an alternative to (^) , obtaining a small but statistically significant improvement 
over a version of Whirl using (H) . In their experiments this technique outperformed 
a number of other classifiers, such as a C4.5 decision tree classifier and the Ripper 
CNF rule-based classifier. 



A variant of the basic fc-NN approach is proposed by Galavotti et al. 1 2000 1 , who 
reinterpret (jp) by redefining |a] as 



^1 



1 if a = T 
-1 if a = F 



The difference from the original fc-NN approach is that if a training document 
dz similar to the test document dj does not belong to this information is not 
discarded but weights negatively in the decision to classify dj under Ci. 



A com bination of profile- and example-based methods is presented in [ Lam and 



Ho 1998 1 . In this work a fc-NN system is fed generalized instances (GIs) in place of 
training documents. This approach may be seen as the result of 

— clustering the training set, thus obtaining a set of clusters Ki = {kn, . . . , fc^i/^ .|}; 
— building a profile G{kiz) ("generalized instance") from the documents belonging 

to cluster kiz by means of some algorithm for learning linear classifiers (e.g. 

Rocchio, Widrow-Hoff); 
— applying fc-NN with profiles in place of training documents, i.e. computing 

dej ^ uc^r^^ r^n w \{dj hz\ Hdj.Ci) ^ T}\ G fc^Jj 



CSV^id,) =^ > RSV{dj,G{hz)) 



\{d, e hz}\ \Tr\ 



where I^^j ^'^"^^^ .gfc'^'}|'''~"^"^^ represents the "degree" to which G{kiz) is a positive 
instance of Ci, and ^"^'^j^^"-^^ represents its weight within the entire process. 

This exploits the superior effectiveness (see Figure ^) of fc-NN over linear classifiers 
while at the same time avoiding the sensitivity of fc-NN to the presence of "outliers" 
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Fig. 4. Learning support vector classifiers. The small crosses and circles represent positive and 
negative training examples, respectively, whereas lines represent decision surfaces. Decision surface 
cTi (indicated by the thicker line) is, among those shown, the best possible one, as it is the middle 
element of the widest set of parallel decision surfaces (i.e. its minimum distance to any training 
example is maximum). Small boxes indicate the support vectors. 



(i.e. positive instances of Ci that "lie out" of the region where most other positive 
instances of Ci are located) in the training set. 

6.10 Building classifiers by support vector machines 

The support vector machine (SVM) method has been introduced in TC by Joachims 



|1998, 


1999 


and subsequently used in | 


Drucker et al. 1999 


Dumais et al. 1998; 


Du- 


mais and Chen 2000: 


Klinkenberg and Joachims 2000|; 


Paira and Haruno 1999; 



among all the surfaces ai,a2, ■ ■ ■ in |T|-dimensional space that separate the posi- 
tive from the negative training examples {decision surfaces), the that separates 
the positives from the negatives by the widest possible margin, i.e. such that the 
separation property is invariant with respect to the widest possible traslation of ai. 

This idea is best understood in the case in which the positives and the negatives 
are linearly separable, in which case the decision surfaces are (|T| — l)-hyperplanes. 
In the 2-dimensional case of Figure ^, various lines may be chosen as decision 
surfaces. The SVM method chooses the middle element from the "widest" set of 
parallel lines, i.e. from the set in which the maximum distance between two elements 
in the set is highest. It is noteworthy that this "best" decision surface is determined 
by only a small set of training examples, called the support vectors. 

The method described is applicable also to the case in which the positives and the 



negatives are not linearly separable. Yang and Liu [1999| experimentally compared 
the linear case (namely, when the assumption is made that the categories are linearly 
separable) with the non-linear case on a standard benchmark, and obtained slightly 
better results in the former case. 



As argued by Joachims |1998|, SVMs offer two important advantages for TC: 
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-term selection is often not needed, as SVMs tend to be fairly robust to overfitting 
and can scale up to considerable dimensionalities; 

-no human and machine effort in parameter tuning on a validation set is needed, as 
there is a theoretically motivated, "default" choice of parameter settings, which 
has also been shown to provide the best effectiveness. 



Dumais et al. |199S] have applied a novel algorithm for training SVMs which brings 



about training speeds comparable to computationally easy learners such as Rocchio. 
6.11 Classifier committees 

Classifier committees (aka ensembles) are based on the idea that, given a task that 
requires expert knowledge to perform, k experts may be better than one if their 
individual judgments are appropriately combined. In TC, the idea is to apply k 
different classifiers $ i , . . . , $fe to the same task of deciding whether dj € Ci , and then 
combine their outcome appropriately. A classifier committee is then characterized 
by (i) a choice of k classifiers, and (ii) a choice of a combination function. 

Concerning issue (i), it is known from the ML literature that, in order to guaran- 
tee good effectiveness, the classifiers forming the committee should be as indepen- 



dent as possible [Tumer and Ghosh 1996|. The classifiers may differ for the indexing 
approach used, or for the inductive method, or both. Within TC, the avenue which 
has been explored most is the latter (to our knowledge the only example of the 



former is [ pcott and Matwin 1999 1) 



Concerning issue (ii), various rules have been tested. The simplest one is majority 
voting (MV), whereby the binary outputs of the k classifiers are pooled together, 
and the classification decision that reaches the majorit y of votes is taken ( fc 



obviously needs to be an odd number) [Li and Jain 1998 ; Liere and Tadepalli 1997] 



This method is particularly suited to the case in which the committee includes 
classifiers characterized by a binary decision function CSVt : V {T,F}. A 
second rule is weighted linear combination (WLC) , whereby a weighted sum of the 
CSVi^s produced by the k classifiers yields the final CSVi. The weights Wj reflect 
the expected rela tive effectiveness of class ifiers $j, and are usually optimized on 



a validation set [ Larkey and Croft 1996 1. Another policy is dynamic classifier 



selection (DCS), whereby among committee {$i, the classifier $t most 

effective on the I validation examples most similar to dj is selected, and its judgment 
adopted by the committee [ Li and Jain 199^ . A still different policy, somehow 



intermediate between WLC and DCS, is adaptive classifier combination (ACC), 
whereby the judgments of all the classifiers in the committee are summed together, 
but their individual contribution is weighted by their effectiveness on the I validation 
examples most similar to dj [ Li and Jain 199^ . 



Classifier committees have had mixed results in TC so far. Larkey and Croft 



[1996[ have used combinations of Rocchio, Naive Bayes and fc-NN, all together or 
in pairwise combinations, using a WLC rule. In their experiments the combination 
of any two classifiers outperformed the best individual classifier (fc-NN), and the 
combination of the three classifiers improved an all three pairwise combinations. 
These results would seem to give strong support to the idea that classifier com- 
mittees can somehow profit from the complementary strengths of their individual 
members. However, the small size of the test set used (187 documents) suggests 



36 • F. Sebastiani 



that more experimentation is needed before conclusions can be drawn. 

Li and Jain 1998| have tested a committee formed of (various combinations of) 



a Nai've Bayes classifier, an example-based classifier, a decision tree classifier, and 
a classifier built by means of their own "subspace method" ; the combination rules 
they have worked with are MV, DCS and ACC. Only in the case of a committee 
formed by Naive Bayes and the subspace classifier combined by means of ACC 
the committee has outperformed, and by a narrow margin, the best individual 
classifier (for every attempted classifier combination ACC gave better results than 
MV and DCS). This seems discouraging, especially in the light of the fact that 
the committee approach is computationally expensive (its cost trivially amounts 
to the sum of the costs of the individual classifiers plus the cost incurred for the 
computation of the combination rule). Again, it has to be remarked that the small 
size of their experiment (two test sets of less than 700 documents each were used) 
does not allow to draw definitive conclusions on the approaches adopted. 



G.fl.l Boosting. The boosting method [ pchapire et al. 1998| ; ^chapirc and Singer 



2000] occupies a special place in the classifier committees literature, since the k 
classifiers $i,...,$/c forming the committee are obtained by the same learning 
method (here called the weak learner). The key intuition of boosting is that the 
k classifiers should be trained not in a conceptually parallel and independent way, 
as in the committees described above, but sequentially. In this way, in training 
classifier $i one may take into account how classifiers $1, . . . , perform on 
the training examples, and concentrate on getting right those examples on which 
$1, . . . , have performed worst. 

Specifically, for learning classifier $t each {dj,Ci) pair is given an "importance 
weight" hjj (where h}j is set to be equal for all {dj,Ci) pairs|^), which represents how 
hard to get a correct decision for this pair was for classifiers $1, . . . , $t-i. These 
weights are exploited in learning <i>t , which will be specially tuned to correctly solve 
the pairs with higher weight. Classifier $t is then applied to the training documents, 
and as a result weights hjj are updated to hlj'^; in this update operation, pairs 
correctly classified by $t will have their weight decreased, while pairs misclassified 
by $t will have their weight increased. After all the k classifiers have been built, a 
weighted linear combination rule is applied to yield the final committee. 



In the BoosTexter system [pchapire and Singer 200C| ], two different boosting 



algorithms are tested, using a one-level decision tree we ak learner. The for mer 



algorithm (AdaBoost.MH, simply called AdaBoost in |Schapire et al. I998|| ) is 
explicitly geared towards the maximization of microaveraged effectiveness, whereas 
the latter (AdaBoost.MR) is aimed at minimizing ranking loss (i.e. at getting a 
correct category ranking for each individual docume nt). I n experiments conducted 



over three different test collections, Schapire et al. [1998| have shown AdaBoost 
to outperform Sle eping Experts, a class ifier that had proven quite effective in 
the experiments of []Cohen and Singer f99£ ]. Further experiments by Schapire and 



Singer |pOOG | showed AdaBoost to outperform, aside from Sleeping Experts, a 



Nai've Bayes classifier, a standard (non-enhanced) Rocchio classifier, and Joachims' 



15c 



Schapire et al. 1 1998 also show that a si mple m odification of this policy allows optimization of 



the classifier based on "utility" (see Section 7.1.3) 
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| 1997[ PrTFIDF classifier. 

A boosting algorithm based on a "committee of classifier sub-committees" that 
improves on the effectiveness and (especially) the efficiency of AdaBoost.MH 
is presented in [ Sebastiani et al. 200C| ]. An approach similar to boosting is also 
employed by Weiss et al. |199£], who experiment with committees of decision trees 
each having an average of 16 leaves (hence much more complex than the simple 
2-leaves trees used in |Schapire and Singer 2000|), eventually combined by using 
the simple MV rule as a combination rule; similarly to boosting, a mechanism for 
emphasising documents that have been misclassified by previous decision trees is 
used. Boosting-based approache s have also been employed in | Escudero et al. 200C ; 
Iyer et al. 2doo| ; |Kim et al. 200C| ; |Li and Jain 1998| ; [Myers et al. 200(| . 



6.12 Other methods 

Although in the previous sections we have tried to give an overview as complete 
as possible of the learning approaches proposed in the TC literature, it would be 
hardly possible to be exhaustive. Some of the learning approaches adopted do 
not fall squarely under one or the other class of algorithms, or have remained 
somehow isolated attempts. Among these, the most noteworthy are the ones based 
on Bayesian inference networks |Dumais et al. 199S 
Hartmann 1993 1, g enetic algorithms [[Clack et al. 1997 



entropy modelling | Manning and Schiitzc 1999| 



Lam et al. 1997: Tzeras and 



Masand 1994| , and 



maximum 



7. EVALUATION OF TEXT CLASSIFIERS 

As for text search systems, the evaluation of document classifiers is typically con- 
ducted experimentally, rather than analytically. The reason is that, in order to eval- 
uate a system analytically (e.g. proving that the system is correct and complete) 
we would need a formal specification of the problem that the system is trying to 
solve (e.g. with respect to what correctness and completeness are defined), and the 
central notion of TC (namely, that of membership of a document in a category) is, 
due to its subjective character, inherently non-formalisable. 

The experimental evaluation of a classifier usually measures its effectiveness 
(rather than its efficiency), i.e. its ability to take the right classification decisions. 

7.1 Measures of text categorization effectiveness 

7.1.1 Precision and recall. Classification effectiveness is usually measured in terms 
of the classic IR notions of precision (tt) and recall (p), adapted to the case of 
TC. Precision wrt a (tt^) is defined as the conditional probabihty P($(d2;,Ci) = 
T \<^{dxTCi) — T), i.e. as the probability that if a random document is classi- 
fied under c^, this decision is correct. Analogously, recall wrt Cj {pi) is defined as 
P{(^{dx,Ci) — T I ^{dx,Ci) = T), i.e. as the probability that, if a random document 
dx ought to be classified under c^, this decision is taken. These category-relative 
values may be averaged, in a way to be discussed shortly, to obtain tt and p, i.e. 
values global to the entire category set. Borrowing terminology from logic, tt may 
be viewed as the "degree of soundness" of the classifier wrt C, while p may be 
viewed as its "degree of completeness" wrt C. As defined here, -Ki and pi are to be 
understood as subjective probabilities, i.e. as measuring the expectation of the user 
that the system will behave correctly when classifying an unseen document under 
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Category 


expert judgments 


Ci 




YES 


NO 


classifier 


YES 


TP, 


FP, 


judgments 


NO 


FN, 


TN, 



Table 2. The contingency table for category c^. 



Category set 
C = {ci,...,C|c|} 


expert judgments 


YES 


NO 


classifier 
judgments 


YES 


|C| 

TP = ^ TP, 

i = l 


|C| 

pp = ^ pp» 

i = l 


NO 


|C| 

PAf = ^ PA^i 

i=l 


|C| 

TAf = ^ TA^i 
1=1 



Table 3. The global contingency table. 



Ci. These probabilities may be estimated in terms of the contingency table for Ci 
on a given test set (see Table ||). Here, FPi (false positives wrt Ci, aka errors of 
commission) is the number of test documents incorrectly classified under a; TNi 
{true negatives wrt Ci), TPi (true positives wrt Ci) and FNi (false negatives wrt ci, 
aka errors of omission) are defined accordingly. Estimates (indicated by carets) of 
precision wrt Ci and recall wrt Ci may thus be obtained as 

TP, . TP, 



TP, + FPi ' TPi + FN, 

For obtaining estimates of tt and p, two different methods may be adopted: 
— microaveraging: tt and p are obtained by summing over all individual decisions: 

_ TP _ Y}-1iTP, 



TP + FP Y}-li{TP, + FPi) 
_ TP _ EfJi^^. 



TP + FN y}-X{TP. + FN,) 

where "/i" indicates microaveraging. The "global" contingency table (Table is 
thus obtained by summing over category-specific contingency tables. 
-macroaveraging : precision and recall are first evaluated "locally" for each cate- 
gory, and then "globally" by averaging over the results of the different categories: 

J2i=l -A/ _ EUi Pi 



-M .-^ 

TT = — p 



\c\ \c\ 

where "M" indicates macroaveraging. 

These two methods may give quite different results, especially if the different cat- 
egories have very different generality. For instance, the ability of a classifier to 
behave well also on categories with low generality (i.e. categories with few posi- 
tive training instances) will be emphasized by macroaveraging and much less so by 
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microaveraging. Whether one or the other should be used obviously depends on 
the application requirements. From now on, we will assume that microaveraging is 
used; everything we will say in the rest of Section ^ may be adapted to the case of 
macroaveraging in the obvious way. 

7.1.2 Other measures of effectiveness. Measures alternative to tt and p and com- 
monly used in the ML literature, such as accuracy (estimated a.s A — tp+tn 



TP+TN+FP+FN I 

and error (estimated as E — jp^^ mX^^+fn — 1 ^ not widely used in TC. 

The reason is that, as Yang | 1999| points out, the large value that their denomi- 



nator typically has in TC makes them much more insensitive to variations in the 
number of correct decisions {TP + TN) than tt and p. Besides, if A is the adopted 
evaluation measure, in the frequent case of a very low average generality the trivial 
rejector (i.e. the classifier $ such that $(dj,Ci) — F for all dj and Ci) tends to 



outperform all non-trivial classifiers (see also |Cohcn 1995a, Section 2.3]). If A is 
adopted, parameter tuning on a validation set may thus result in parameter choices 
that make the classifier behave very much like the trivial rejector. 

A non-standard effectiveness measure is proposed by Sable and Hatzivassiloglou 
[ pOOO , Section 7], who suggest to base tt and p not on "absolute" values of success 



and failure (i.e. 1 if <i>(dj, = $(dj , and if ^{dj,Ci) / ^{dj,Ci)), but on values 
of relative success (i.e. CSVi{dj) if ^{dj,Ci) — T and 1 — CSVi{dj) if l>(dj,Ci) = 
F). This means that for a correct (resp. wrong) decision the classifier is rewarded 
(resp. penalized) proportionally to its confidence in the decision. This proposed 
measure does not reward the choice of a good thresholding policy, and is thus unfit 
for autonomous ("hard") classification systems. However, it might be appropriate 



for interactive ("ranking") classifiers of the type used in |Larkey 1999|, where the 
confidence that the classifier has in its own decision influences category ranking 
and, as a consequence, the overall usefulness of the system. 

7.1.3 Measures alternative to effectiveness. In general, criteria different from ef- 
fectiveness are seldom used in classifier evaluation. For instance, efficiency, al- 
though very important for applicative purposes, is seldom used as the sole yardstick, 
due to the volatility of the parameters on which the evaluation rests. However, 
efficiency may be useful for choosing among classifiers with similar effectiveness. 
An interesting evaluation has been carried out by Dumais et al. 1998| , who have 



compared five different learning methods along three different dimensions, namely 
effectiveness, training efficiency (i.e. the average time it takes to build a classifier 
for category Ci from a training set Tr), and classification efficiency (i.e. the average 
time it takes to classify a new document dj under category a). 

An important alternative to effectiveness is utility, a class of measures from de- 
cision theory that extend effectiveness by economic criteria such as gain or loss. 
Utility is based on a utility matrix such as that of Table ^, where the numeric val- 
ues utp, ufp, ufn and utn represent the gain brought about by a true positive, 
false positive, false negative and true negative, respectively; both utp and utn are 
greater than both upp and ufn- "Standard" effectiveness is a special case of util- 
ity, i.e. the one in which utp — utn > upp = ufn- Less trivial cases are those in 
which Utp ^ utn and/or upp ^ upN', this is the case e.g. in spam filtering, where 
failing to discard a piece of junk mail (FP) is a less serious mistake than discarding 
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Category set 
C = {ci,...,C|c|} 


expert judgments 


YES 


NO 


classifier 
judgments 


YES 




Upp 


NO 


UpM 


Utn 



Table 4. The utility matrix. 



a legitimate message (FN) | Androutsopoulos ct al. 200C]. If the classifier outputs 
probability estimates of the membership of dj in ci, then decision theory provides 
analytical methods to determine thresholds , thus avoiding the need to determine 
them experimentally (as discussed in Section 6.1). Specifically, as Lewis |1995a] 
reminds, the expected value of utility is maximized when 

{upp - Utn) 



(ufn - Utp) + (upp - utn) 



which, in the case of "standard" effectiveness, is equal to i. 



where utility is 


employed are |Amati and Crestani 1999 




Cohen and Singer 1999; 


Hull et al. 1996 




Lewis and Catlett 1994 




Schapire et al. 


1998 1. Utility has become 



popular within the text filtering com munity, and the TREC "filtering track" evalu- 
ations have been using it since long [ Lewis 1995c | . The values of the utility matrix 
are extremely application-dependent. This means that if utility is used instead of 
"pure" effectiveness, there is a furthe r el ement of difficulty in the cross-comparison 
of classification systems (see Section 7^), since for two classifiers to be experimen- 
tally comparable also the two utility matrices must be the same. 

Other effectiveness measures different from the ones discussed here have occa- 



sionally been used in the literatur e; these inc lude adjacent score [Larkey 199j ], 
coverage [ Schapire and Singer 200C |, one- e rror [^chapire and Singer 200(|, Pearson 
product-mo ment correlation [ Larkey 1998|, recall at n \ Larkey and Croft 1996|, top 
candidate [ Larkey and Croft 1996 |, top n [ Larkey and Croft 1996 [. We will not 
attempt to discuss them in detail. However, their use shows that, although the TC 
community is making consistent efforts at standardising experimentation protocols, 
we are still far from universal agreement on evaluation issues and, as a consequence, 
from understanding precisely the relative merits of the various methods. 

7.1.4 Combined effectiveness measures. Neither precision nor recall make sense 
in isolation of each other. In fact the classifier $ such that $(dj,Ci) = T for all 
dj and Ci (the trivial acceptor) has p = 1. When the CSVi function has values in 
[0, 1] one only needs to set every threshold to to obtain the trivial acceptor. In 
this case tt would usually be very low (more precisely, equal to the average test set 

generality '=^|^p — Conversely, it is well-known from everyday IR practice 
that higher levels of tt may be obtained at the price of low values of p. 



^^From this one might be tempted to infer, by symmetry, that the trivial rejector always has 
TT = 1. This is false, as tt is undefined (the denominator is zero) for the trivial rejector (see Table 
In fact, it is clear from its definition (tt = tp+pp ) that tt depends only on how the positives 
(TP -|- FP) are split between true positives TP and the false positives FP, and does not depend 
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Precision 
1 r 


Recall 


C-precision 

T" AT 


C-recall 

7^ AT 


TP + FP 


TP + FN 


FP + TN 


TN + FN 


Trivial Rejector TP=FP=0 


undefined 


^=0 

FN 


TN 
TN ~ 


TN 
TN + FN 


Trivial Acceptor FN=TN=0 


TP 
TP + FP 


TP 
TP ~ 


FP 


undefined 


Trivial "Yes" Collection FP=TN=0 


TP 
TP ~ 


TP 
TP + FN 


undefined 


FN 


Trivial "No" Collection TP=FN=0 


^ = 
FP 


undefined 


TN 
FP + TN 


TN 
TN ~ 



Table 5. Trivial cases in TC. 



In practice, by tuning a function CSVi : V {T, F} is tuned to be, in the 
words of Riloff and Lehnert [1994|, more liberal (i.e. improving pi to the detriment 
of TTi) or more conservative (improving tti to the detriment of jOi)|^. A classifier 
should thus be evaluated by means of a measure which combines vr and p^. Various 
such measures have been proposed, among which the most frequent are: 

(1) 11-point average precision: threshold is repeatedly tuned so as to allow pi 
to take up values of 0.0, .1, . . . , .9, 1.0; tt^ is computed for these 11 different 
values of Tj, and averaged over the 11 resulting values. This is analogous to the 
standard evaluation methodology for ranked IR systems, and may be used 
(a) with categories in place of IR queries. This is most frequently used for 



document-ranking classifiers (see e.g [Schiitze et al. 1995; Yang 1994; Yang 



1999; Yang and Pedersen 1997|); 



(b) with test documents in place of IR queries and categories in place of doc- 
uments. This is most frequently used for categ ory-ranking classifiers (see 
e.g. [Lam et al. 1999[ [Larkey and Croft 1996 ; Bchapire and Singer 200C ; 
Wiener et al. 1995| ). In this case if macroaveraging is used it needs to be 



redefined on a per-document, rather than per-category basis. 
This measure does not make sense for binary- valued CSVt functions, since in 
this case pi may not be varied at will. 



at all on the cardinality of the positives. There is a breakup of "symmetry" between tt and p 
here because, from the point of view of classifier judgment (positives vs. negatives; this is the 
dichotomy of interest in trivial acceptor vs. trivial rejector) the "symmetric" of p ( j-p+FN ^ 
TT ( xp+FP ) c-precision {n'^ = -pp-pj^)t the "contrapositive" of n. In fact, while p=l and 
tt'^=0 for the trivial acceptor, 7r'^=l and p=0 for the trivial rejector. 

""^^ While pi can always be increased at will by lowering t^, usually at the cost of decreasing TTi, 
TTi can usually be increased at will by raising r^, always at the cost of decreasing pi. This kind of 
tuning is only possible for CSVi functions with values in [0, 1]; for binarv-valued CS Vi functions 
tuning is not always possible, or is anyway more difficult (see e.g. [Weiss et al. 1999, page 66]). 
^^An exception is single-label TC, in which tt and p are not independent of each other: if a 
document dj has been classified under a wrong category Cs (thus decreasing tTs) this also means 
that it has not been classified under the right category ct (thus decreasing pt). In this case either 
TT or p can be used as a measure of effectiveness. 
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(2) the breakeven point, i.e. the value at which tt equals p (e.g. |Apte et al. 1994 ; 
Cohen and Singer 1999|; Pagan et al. 1997; Joachims 1998; Joachims 199S; 



Lewis 1992a|; Lewis and Ringuette 1994; Moulinier and Ganascia 1996; Ng 
et al. 1997; Yang 1999( |). This is obtained by a process analogous to the one 



used for 11-point average precision: a plot of tt as a function of p is computed 
by repeatedly varying the thresholds Ti; breakeven is the value of p (or tt) for 
which the plot intersects the p = tt line. This idea relies on the fact that by 
decreasing the r^'s from 1 to 0, p always increases monotonically from to 1 
and TT usually decreases monotonically from a value near 1 to |^ 5Te(ci). 
If for no values of the r^'s tt and p are exactly equal, the Tj's are set to the 
value for which tt and p are closest, and an interpolated breakeven is computed 
as the average of the values of tt and 

(3) the Ffj functio n [ van Rijsbergen 1979|, C hapter 7], for some < /? < +oo (e.g. 
[pohen 1995a|; pohen and Singer 1999|; [Lewis and Gale 1994| ; [Lewis 1995a| ; 
Moulinier et al. 1996; Ruiz and Srinivasan 1999[ ), where 



F0 = 



iP' + l)7rp 



Here /3 may be seen as the relative degree of importance attributed to tt and p. 
If /3 = then coincides with tt, whereas if /3 = +oo then F/s coincides with 
p. Usually, a value /3 = 1 is used, which attributes equal importance to tt and 
p. As shown in | Moulinier et al. 1996 ; Yang 1999 1 , the breakeven of a classifier 
is always less or equal than its Fi value. 

Once an effectiveness measure is chosen, a classifier can be tuned (e.g. thresholds 
and other parameters can be set) so that the resulting effectiveness is the best 
achievable by that classifier. Tuning a parameter p (be it a threshold or other) 
is normally done experimentally. This means performing repeated experiments on 
the validation set with the values of the other parameters pk fixed (at a default 
value, in the case of a yet-to-be-tuned parameter p^, or at the chosen value, if the 
parameter pk has already been tuned) and with different values for parameter p. 
The value that has yielded the best effectiveness is chosen for p. 

7.2 Benchmarks for text categorization 

Standard benchmark collections that can be used as initial corpora for TC are 
publically available for experimental purposes. The most widely used is the Reuters 
collection, consisting of a set of newswire stories classified under categories related 
to economics. The Reuters collection accounts for most of the experimental work in 
TC so far. Unfortunately, this docs not always translate into reliable comparative 



Breakeven, first proposed by Lewis 1 1992a, L992b|, has been recently criticized. Lewis himself 
(see his message of 11 Sep 1997 10:49:01 to the DDLBETA text categorization mailing list - quoted 
with permission of the author) points out that breakeven is not a good effectiveness measure, since 
(i) there may be no parameter setting that yields the breakeven; in this case the final breakeven 
value, obtained by interpolation, is artificial; (ii) to have p equal tt is not necessarily desirable, 
and it is not clear that a svste m tha t achieves high breakeven can be tuned to score high on other 
effectiveness measures. Yang [ |l999| also notes that when for no value of the parameters tt and p 
are close enough, interpolated breakeven may not be a reliable indicator of effectiveness. 
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results, in the sense that many of these experiments have been carried out in subtly 
different conditions. 

In general, different sets of experiments may be used for cross-classifier compar- 
ison only if the experiments have been performed 

(1) on exactly the same collection (i.e. same documents and same categories); 

(2) with the same "split" between training set and test set; 

(3) with the same evaluation measure and, whenever this measure depends on some 
parameters (e.g. the utility matrix chosen), with the same parameter values. 

Unfortunately, a lot of experimentation, both on Reuters and on other collections, 
has not been performed with these caveat in mind: by testing three different clas- 



sifiers on five popular versions of Reuters, Yang |1999| has shown that a lack of 
compliance with these three conditions may make the experimental results hardly 
comparable among each other. Tabic || lists the results of all experiments known 
to us performed on five major versions of the Reuters benchmark: Reuters-22173 
"Mod Lewis" (column #1), Reuters-22173 "ModApte" (column #2), Reuters-22173 
"Mod Wiener" (column #3), Reuters-21578 "ModApte" (column #4) and Reuters- 
21578[10] "ModApte" (column #5)0. Only experiments that have computed either 
a breakeven or Fi have been listed, since other less popular effectiveness measures 
do not readily compare with these. 

Note that only results belonging to the same column are directly comparable. 
In particular, Yang [1999| showed that experiments carried out on Reuters-22173 
"Mod Lewis" (column #1) are not directly comparable with those using the other 
three versions, since the former strangely includes a significant percentage (58%) 
of "unlabelled" test documents which, being negative examples of all categories, 
tend to depress effectiveness. Also, experiments performed on Reuters-21578[10] 
"ModApte" (column #5) are not comparable with the others, since this collection 
is the restriction of Reuters-21578 "ModApte" to the 10 categories with the highest 
generality, and is thus an obviously "easier" collection. 

Other test collections that have been frequently used are 



—the OHSUMED collection. 


set up by Hersh et al. 


1994 


1 and used in | 


Joachims 


1998 




Lam and Ho 1998 




Lam et al. 1999; 


Lewis et al. 1996; 


Ruiz and Srini- 



vasan 1999 



Yang and Pedersen 1997 |^^. The documents are titles or title-plus- 



abstract's from medical journals (OHSUMED is actually a subset of the Medline 
document base); the categories are the "postable terms" of the MESH thesaurus. 



— the 20 Newsgroups collection, set up by Lang 


1995 1 and used in 


Baker and 


McCallum 199S; 


Joachims 1997 




McCalhmr and Nigam 199^; 


McCallum et al. 


1998; 


Nigam et al. 2000|; ^chapire and Singer 2000| . The documents are messages 



posted to Usenet newsgroups, and the categories are the newsgroups themselves. 



20 rpj^^ Reuters-21578 collection may be freely downloaded for experimentation purposes from 
tittp: //www. research, att . com/~lewis/reuters21578 . htm] and is now considered the "standard" 



variant of Reuters. We do not coyer experiments performed on variants of Reuters different from the 
fiye listed because the small number of authors that have used the same variant makes the reported 
results diffic ult to interpret. T his includes experiments perfor med on the original Reut ers-22173 
"ModHayes" Hayes et al. 199C1 and Reuters-21578 "ModLewis" [Cohen and Singer 1999]. 
^^Thc OHSUMED collection may b e freely downloaded for experimentation purposes from 
ftp : //medir . ohsu.edu/pub/ ohsumed 
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I #3 I #4 I 







9^ of documents 
# of training documents 
T^t of test documents 
of categories 


21,450 
14.704 
6,746 
135 


14,347 
10,667 
3,680 
93 


13,272 
9,610 
3,662 
92 


12,902 
9,603 
3,299 
90 


12,902 
9,603 
3,299 
10 



Systen 




-Ic. 



obabilistii 
obabilisti< 
obabilistii 
obabilistit 
obabilistii 
obabilistii 
obabilistii 



.443 (MFi) 
.650 



.747 
.773 
.795 



C4.5 

IND 



Swap- 1 

Ripper 
Sleeping Experts 

Dl-Esc 
Charade 
Charade 



\ptc ct al. iyy4 

|Oot!cn and tiingcr iluUl 
lUohcn and timgcr iUuJ 



Llsf 
Llsf 




.683 
.753 



.805 
.811 



.738 
■783 (Fi) 



.820 
.827 
.820 



Balanced W innow 

WlDROW-HOFF 



-lii 



: li] 



ROCCHIO 

FindSim 

ROCCHIO 
ROCCHIO 
ROCCHIO 

Classi 



batch li] 
batch li] 
batch li] 
batch li] 
batch li] 



Nket 



ral 



letwork 
letwork 
letwork 



.776 
.617 



.781 
.625 



Gis-W 
k-NN 
k-NN 
k-NN 
k-NN 



amplc-ba 
nplc-ba 



.860 
.820 



SvmLight 
SvmLight 
SvmLight 



SVM 
SVM 
SVM 
SVM 



AdaBoo.st.MH 



littcc 

littCG 



.841 
.859 



Baycsiar 



.542 (MFj) 



Table 6. Comparative results among different classifiers obtained on five diff"erent version of 
Reuters. Unless otherwise noted, entries indicate the microaveraged breakeven point; within 
parentheses, "M" indicates macroaver aging and "i^i" indicates use of the F\ measure. Bold- 
face indicates the best performer on the collection. 





the AP collection, used in ||Cohen 1995a; 


Cohen 1995t; 


Cohen and Singer 1999; 




Lewis and Catlctt 1994; 


Lewis and Gale 1994; 


Lewis ct al. 1996; 3chapirc and 


Singer 2000; 


Scliapire ct al. 199S]. 



We will not cover the experiments performed on these collections for the same 
reasons as those illustrated in Footnote i.e. because in no case a significant 
enough number of authors have used the same collection in the same experimental 
conditions, thus making comparisons difficult. 



7.3 Which text classifier is best? 

The published experimental results, and especially those listed in Table ^, allow us 
to attempt some considerations on the comparative performance of the TC methods 
discussed. However, we have to bear in mind that comparisons are reliable only 
when based on experiments performed by the same author under carefully con- 
trolled conditions. They are instead more problematic when they involve different 
experiments performed by different authors. In this case various "background con- 
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ditions" , often extraneous to the learning algorithm itself, may influence the results. 
These may include, among others, different choices in pre-processing (stemming, 
etc.), indexing, dimensionality reduction, classifier parameter values, etc., but also 
different standards of compliance with safe scientific practice (such as tuning pa- 
rameters on the test set rather than on a separate validation set), which often are 
not discussed in the published papers. 



Two different methods may thus be applied for comparing classifiers | Yang 1999] 



— direct comparison: classifiers $' and <&" may be compared when they have been 
tested on the same collection 17, usually by the same researchers and with the 
same background conditions. This is the more reliable method. 

— indirect comparison: classifiers $' and <&" may be compared when 

(1) they have been tested on collections fi' and fl", respectively, typically by 
different researchers and hence with possibly different background conditions; 

(2) one or more "baseline" classifiers <I>i, . . . , have been tested on both il' 
and 17" by the direct comparison method. 

Test|| gives an indication on the relative "hardness" of 11' and 17"; using this and 
the results from Test |l| we may obtain an indication on the relative effectiveness 
of <&' and <i>". For the reasons discussed above, this method is less reliable. 

A number of interesting conclusions can be drawn from Table |^ by using these two 
methods. Concerning the relative "hardness" of the five collections, if by O' > 
17" we indicate that 17' is a harder collection that O", there seems to be enough 
evidence that Reuters-22173 "Mod Lewis" > Reuters-22173 "ModWiener" > Reuters- 
22173 "ModApte" « Reuters-21578 "ModApte" > Reuters-21578[10] "ModApte ' . 
These facts are unsurprising; in particular, the first and the last inequalities are a 
direct consequence of the peculiar characteristics of Reuters-22173 "ModLewis" and 



Reuters-21578[10] "ModApte" discussed in Section [?\2. 

Concerning the relative performance of the classifiers, remembering the consid- 
erations above we may attempt a few conclusions: 

— Boosting-based classifier committees, support vector machines, example-based 
methods, and regression methods deliver top-notch performance. There seems to 
be no sufficient evidence to decidedly opt for either method; efficiency consider- 
ations or application-dependent issues might play a role in breaking the tie. 

— Neural networks and on-line linear classifiers work very well, although slightly 
worse than the previously mentioned methods. 

— Batch linear classifiers (Rocchio) and probabilistic Naive Bayes classifiers look 
the worst of the learning-based class ifiers. For Rocchio, these results confirm 
earlier results by Schiitze et al. |l995 |, who had found three classifiers based on 



linear discriminant analysis, linear regression, and neural networks, to perform 



about 15% better than Rocchio. However, recent results by Schapire et al. |1998 
rank Rocchio along the best performers once near-positives are used in training, 

-The data in Table ^ are hardly sufficient to say anything about decision trees 



However, the work by Dumais et al. [1998| in which a decision tree classifier was 



shown to perform nearly as well as their top performing system (a SVM classifier) 
will probably renew the interest in decision trees, an interest that had dwindled 
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after the unimpressive results reported in ea rlier literature JCohen and Singer 
1999| ; [Joachims 1998| ; [Lewis and Catlett 1994| ; [Lewis and Ringuette 1994[ |. 

-By far the lowest performance is displayed by Word, a classifier implemented 



by Yang |1999| and not including any learning component^. 



Concerning WORD and no- learning classifiers, for completeness we should recall 
that one of the highest effectiveness values reported in the literature for the Reuters 
collection (a .90 breakeven) belongs to Construe, a manually constructed clas- 
sifier. However, this classifier has never been tes ted on the standard variants of 
Reuters mentioned in Table ||, and it is not clear [ Yang 1999 | whether the (small) 
test set of Reuters-22173 "ModHayes" on which the .90 breakeven value was ob- 
tained was chosen randomly, as safe scientific practice would demand. Therefore, 
the fact that this figure is indicative of the performance of Cons true, and o f the 
manual approach it represents, has been convincingly questioned [ Yang 199E ]. 

It is important to bear in mind that the considerations above are not absolute 
statements (if there may be any) on the comparative effectiveness of these TC 
methods. One of the reasons is that a particular applicative context may exhibit 
very different characteristics from the ones to be found in Reuters, and different 
classifiers may respond differently to these characteristics. An experimental study 
by Joachims 1998| involving support vector machines, /c-NN, decision trees, Roc- 
chio and Naive Bayes, showed all these classifiers to have similar effectiveness on 
categories with > 300 positive training examples each. The fact that this experi- 
ment involved the methods which have scored best (support vector machines, A:-NN) 
and worst (Rocchio and Naive Bayes) according to Table ^ shows that applicative 
contexts different from Reuters may well invalidate conclusions drawn on this latter. 

Finally, a note is worth about statistical significance testing. Few authors have 
gone to the trouble of validating their results by means of such tests. These tests 
are useful for verifying how strongly the experimental results support the claim that 
a given system $' is better than another system or for verifying how much a 
difference in the experimental setup affects the measured effectiveness of a system 



Hull [1994 1 and Schiitze et al. [1995[ have been among the first to work in this 
direction, validating their results by means of the Anova test and the Friedman 
test; the former is aimed at determining the significance of the difference in effec- 
tiveness between two methods in terms of the ratio between this difference and the 
effectiveness variability across categories, while the latter conducts a similar test by 
using instead the rank positions of each method within a category. Yang and Liu 



[1999[ define a full suite of significance tests, some of which apply to microaveraged 
and some to macroaveraged effectiveness. They apply them systematically to the 
comparison between five different classifiers, and are thus able to infer fine-grained 
conclusions about their relative effectiveness. For other examples of significance 



testing in TC see [Cohen 1995a; Cohen 1995b; 


Cohen and Hirsh 1998 


1997 


; KoUer and Sahami 1997; 


Lewis et al. 1996; 


Wiener et al. 1995 [. 



^^WORD is based on the comparison between documents and category names, each treated as a 
vector of weighted terms in the vector space model. WORD was implemented by Yang with the 
only purpose of determining the difference in effectiveness t hat adding a learning component t o 
a classifier brings about. WORD is actually called STR in [ Yang 1994; ^ang and Chute 1994 . 
Another no-learning classifier is proposed in [Wong ct al. 1996r 
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8. CONCLUSION 

Automated TC is now a major research area within the information systems disci- 
pline, thanks to a number of factors 

— Its domains of apphcation are numerous and important, and given the proHfer- 
ation of documents in digital form they are bound to increase dramatically in 
both number and importance. 

— It is indispensable in many applications in which the sheer number of the doc- 
uments to be classified and the short response time required by the application 
make the manual alternative implausible. 

— It can improve the productivity of human classifiers in applications in which no 



classification decision can be taken without a final human judgment [ Larkey and 



Croft 1996], by providing tools that quickly "suggest" plausible decisions. 



— It has reached effectiveness levels comparable to tho se of trained pr ofessionals. 
The effectiveness of manual TC is not 100% anyway [ Clevcrdon 1984 ] and, more 
importantly, it is unlikely to be improved substantially by the progress of research. 
The levels of effectiveness of automated TC are instead growing at a steady pace, 
and even if they will likely reach a plateau well below the 100% level, this plateau 
will probably be higher that the effectiveness levels of manual TC. 

One of the reasons why from the early '90s the effectiveness of text classifiers has 
dramatically improved, is the arrival in the TC arena of ML methods that are 
backed by strong theoretical motivations. Examples of these are multiplicative 
weight updating (e.g. the Winnow family, Widrow-Hoff, etc.), adaptive resam- 
pling (e.g. boosting) and support vector machines, which provide a sharp contrast 
with relatively unsophisticated and weak methods such as Rocchio. In TC, ML 
researchers have found a challenging application, since datasets consisting of hun- 
dreds of thousands of documents and characterized by tens of thousands of terms 
are widely available. This means that TC is a good benchmark for checking whether 
a given learning technique can scale up to substantial sizes. In turn, this probably 
means that the active involvement of the ML community in TC is bound to grow. 

The success story of automated TC is also going to encourage an extension of its 
methods and techniques to neighbouring fields of application. Techniques typical 
of automated TC have already been extended successfully to the categorization of 
documents expressed in slightly different media; for instance: 



■very noisy text resulting from optical character recognition [Ittner ct al. 1995; 
Junker and Hoch 199S]. In their experiments Ittner et al. |1995| have found that. 



by employing noisy texts also in the training phase (i.e. texts affected by the same 
source of noise that is also at work in the test documents), effectiveness levels 
comparable to those obtainable in the case of standard text can be achieved, 
-speech transcripts ]Myers et al. 200C; 3chapire and Singer 2000 . For instance. 



Schapire and Singer ]2000] classify answers given to a phone operator's request 
"How may I help you?", so as to be able to route the call to a specialized op- 
erator according to call type. 

Concerning other more radically different media, the situation is not as bright (how- 
ever, see ]Lim 1999| for an interesting attempt at image categorization based on a 
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textual metaphor). The reason for this is that capturing real semantic content of 
non-textual media by automatic indexing is still an open problem. While there are 
systems that attempt to detect content e.g. in images by recognising shapes, colour 
distributions and texture, the general problem of image semantics is still unsolved. 
The main reason is that natural language, the language of the text medium, ad- 
mits far fewer variations than the "languages" employed by the other media. For 
instance, while the concept of a house can be "triggered" by relatively few natural 
language expressions such as house, houses, home, housing, inhabiting, etc., it 
can be triggered by far more images: the images of all the different houses that 
exist, of all possible colours and shapes, viewed from all possible perspectives, from 
all possible distances, etc. If we had solved the multimedia indexing problem in 
a satisfactory way, the general methodology that we have discussed in this paper 
for text would also apply to automated multimedia categorization, and there are 
reasons to believe that the effectiveness levels could be as high. This only adds to 
the common sentiment that more research in automated content-based indexing for 
multimedia documents is needed. 
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